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[57] ABSTRACT 

A method of instruction dispatch is provided in which a 
direcdy-decoded instruction and a microcode instruction are 
concurrcndy dispatched ("packed"). The instruction which 
is second in program order is retained until the succeeding 
clock cycle. During the succeeding clock cycle, a microcode 
unit determines if the microcode instruction and the directly- 
decoded instruction, when taken together, occupy less than 
or equal to the total number of issue positions available in 
the microprocessor. If the microcode unit determines that 
less than or equal to the total number of issue positions arc 
occupied, then the packing is successful. If the microcode 
unit determines that greater than the total number of issue 
positions arc occupied, then the packing is unsuccessful and 
the retained instruction is redispatched. Additionally, 
instruction dispatch selection is performed in two phases. 
First, a number of instructions are selected as potentially 
dispatchable instructions. From the potentially dispatchablc 
instructions, a set of achially dispatched instructions may be 
selected based upon the success or failure of instruction 
packing during the previous clock cycle and whether or not 
packing was performed. If instruction packing was not 
performed during the previous clock cycle or was performed 
unsuccessfully, then the instructions which are foremost in 
program order within the potentially dispatchable instruc- 
tions are selected. However, if instruction packing was 
successfully performed in the previous clock cycle, then the 
retained instruction is not selected for dispatch. 

19 Claims, 12 Drawing Sheets 
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METHOD FOR CONCURRENTLY 
DISPATCHING MICROCODE AND 
DIRECTLY-DECODED INSTRUCTIONS IN A 
MICROPROCESSOR 

This application is a Cantinualioo of U.S. Ser. No. 
08/878^28, filed oo Jun. 18, 1997 now U.S. Pat. No. 
5384,058 which is a File Wrapper Continuation of U.S. Ser. 
No. 08/685,656, filed Jul. 24, 1996, now Abn. 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

This invention relates to the field of microprocessors and, 
mort particularly, to instruction dispatch mechanisms within 
microprocessors. 

2. Description of the Relevant Art 

Superscalar microprocessors achieve high performance 
by executing multiple instructions per clock cycle and by 
choosing the shortest possible clock cycle consistent with 
the design. As used herein, the term "clock cycle" refers to 
an interval of time accorded to various stages of an instruc- 
tion processing pipeline within the microprocessor. Storage 
devices (e.g. registers and arrays) capture their values 
according to the clock cycle. For example, a storage device 
may capture a value according to & rising or falling edge of 
a clock signal defining the clock cycle. The storage device 
then stores the value until the subsequent rising or falling 
edge of the clock signal, re^ectively. The term "insiruciioD 
processing pipchnc" is used herein to refer to the logic 
circuits employed to process instructions in a pipelined 
fashion. Although the pipeline may be divided into any 
number of stages at which portions of instruction processing 
are performed, instruction processing generally comprises 
fetching the instruction, decoding the instruction, executing 
the instruction, and storing the execution results in the 
destination identified by the instruction. 

Microprocessor designers often design their products in 
accordance with the x86 microprocessor architecture in 
order to take advantage of its widespread acceptance in the 
computer industry. Because (he x86 microprocessor archi- 
tecture is pervasive, many computer programs are written in 
accordance with the architecture. X86 compatible micropro- 
cessors may execute these computer programs, thereby 
becoming more attractive to computer system designers who 
desire x86-capable computer systems. Such computer sys- 
tems are often well received within the industry due to the 
wide range of available computer programs. 

The x86 microprocessor architecture specifies a variable 
length instruction set (i.e. an instruction set in which various 
instructions employ differing numbers of bytes to specify 
that instruction). For example, the 80386 and later versions 
of x86 microprocessors employ between 1 and 15 bytes to 
specify a particular instruction. Instructions have an opcode, 
which may be 1-2 bytes, and additional bytes may be added 
to specify addressing modes, operands, and additional 
details regarding the instruction to be executed. Certain 
instructions within the xS6 instruction set are quite complex, 
specifying multiple operations to be performed. For 
example, the PUSHA instruction specifies that each of the 
x86 registers be pushed onto a stack defined by the value in 
the ESP register. The corresponding operations are a store 
operation for each register, and decrements of the ESP 
register between each store operation to generate the address 
for the next store operation. 

Often, complex instructions are classified as microcode 
instructions. Microcode instructions are transmitted to a 
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microcode unit within the microprocessor, which decodes 
the complex microcode instruction and produces two or 
more simpler instructions for execution by the microproces- 
sor. The simpler instructions corresponding to the microcode 

s instruction are typically stored in a read-only memory 
(ROM) within the microcode unit. The microcode unit 
determines an address within the ROM at which the simpler 
instructions are stored, and transfers the instructions out of 
the ROM beginning at that address. Multiple clock cycles 

10 may be used to transfer the entire set of instructions corre- 
sponding to the microcode instruction. Each microcode 
instruction may correspond to a particular number of simpler 
instructions dissimilar firom the number of simpler instruc- 
tions corresponding to other microcode instructions. 

15 Additionally, the number of simpler instructions correspond- 
ing to a particular microcode instruction may vary according 
to the addressing mode of the instruction, the operand 
values, and/or the options included with the instruction. The 
microcode unit issues the simpler instructions into the 

20 instruction processing pipeline of the microprocessor The 
simpler instructions arc thereafter executed in a similar 
fashion to other instructions. It is noted that the simpler 
instructions may be instructions defined within the instruc- 
tion set, or may be custom instructions defined for the 

25 particular microprocessor. 

Conversely, less complex instructions are decoded by 
hardware decode units within the microprocessor, without 
intervention by the microcode unit. The term "directly- 
decoded instruction" will be used herein to refer to instruc- 

30 tions which are decoded and executed by the microprocessor 
without the aid of a microcode unit. As opposed to micro- 
code instructions which are reduced to simpler instrurtions 
which may be handled by the microprocessor, directly- 
decoded instructions are decoded and executed via hardware 

35 decode and functional units included within the micropro- 
cessor. 

Unfortunately, having microcode instructions which 
translate to an arbitrary number of simpler instructions 
creates numerous problems for dispatching multiple insiruc- 

*° tions per clock cycle. Because the number of translated 
instructions is not known at the time of transmitting the 
microcode instruction to the microcode unit, instructions arc 
typically not conciu-rently dispatched with the microcode 
instruction. Instead, the microcode instruction is typically 
dispatched alone, and subsequent dispatch is typically 
stalled until the simpler instructions corresponding to the 
microcode instructions have been dispatched. For cases in 
which the microcode instruction corresponds to a number of 
instructions less than the maximum number of instructions 

50 which may be dispatched during a clock cycle, dispatch 
bandwidth (i.e. the number of concurrently dispatched 
instructions) is wasted. Performance of the microprocessor 
may thereby be deleteriously reduced from the level achiev- 
able when dispatch bandwidth is fully utilized. 

" SUMMARY OF THE INVENTION 

The problems outlined above are in large part solved by 
a method of instruction dispatch as described herein. 
According to the method a directly-decoded instruction and 

60 a microcode instruction arc concurrently dispatched 
("packed"). The instruction which is second in program 
order is retained until the succeeding clock cycle. During the 
succeeding clock cycle, a microcode unit determines if the 
microcode instruction and the directly-decoded instruction, 

65 when taken together, occupy less than or equal to the total 
number of issue positions available in the microprocessor. If 
the microcode unit determines that less than or equal to the 
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total number of issue positions are occupied, then the 
packing is successful. If the microcode unit determines that 
greater than the total number of issue positions arc occupied, 
then the packing is unsuccessful and the retained instruction 
is rcdispatchcd. Advantageously, instruction dispatch band- $ 
width is increased when paddng is successful. Performance 
of a microprocessor employing the instruction dispatch 
method may be beneficially increased due to the increased 
utilization of processor resources as compared to micropro- 
cessors which issue microcode instructions without concur- jq 
rent dispatch of other instructions. In other words, a larger 
average number of instructions executed per clock cycle 
may be achieved for instruction code which includes micro- 
code instructions which do not occupy each of the available 
issue positions. 

Additionally, instruction dispatch selection is performed 
in two phases. First, a number of instructions are selected as 
potentially dispatchable instructions. The number selected 
may be larger than the total number of available issue 
positions within the microprocessor. From the potentially 20 
dispatchable instructions, a set of actually dispatched 
instructions may be selected based upon the success or 
failure of instruction packing during the previous clock 
cycle. If instruction packing was not performed during the 
previous clock cycle, then the instructions which are fore- 25 
most in program order within the potentially dispatchable 
instructions are selected. Similarly, if instruction packing 
was unsuccessfully performed during the previous clock 
cycle, the instructions which are foremost in program order 
within the potentially dispatchable instructions are selected. 30 
In this case, the first instruction in program order is the 
retained instruction. However, if instruction packing was 
successfully performed in the previous clock cycle, then the 
retained instruction is not selected for dispatch. 
Advantageously, instructions may be concurrently dis- 35 
patched with the redispatch of the retained instruction when 
instruction packing is unsuccessful. Instruction dispatch 
bandwidth is not sacrificed when redispatch of a previously 
dispatched instruction becomes necessary. A microprocessor 
designed in accordance with the method may thereby 4q 
achieve a maximal average number of instructions dis- 
patched per clock cycle. 

Broadly speaking, the present invention contemplates a 
method for dispatching instructions in a microprocessor 
having a plurality of issue positions comprising multiple 45 
steps. A first iostmctlon and a second instruction are dis- 
patched during a first clock cycle. The first instruction 
precedes the second instruction in program order, and one of 
the first instruction and the second instruction is a microcode 
instruction. The other one of the first instruction and the 50 
second instruction is a directly-decoded instruction. The 
second instruction and a third instruction are selected for 
dispatch during a second clock cycle subsequent to the first 
clock cycle. If the microcode instruction and the directly- 
decoded instruction, when dispatched together, occupy a 55 
first number of issue positions greater than a total number of 
the plurality of issue positions, the second instruction is 
dispatched. Alternatively, the third instruction is dispatched 
during the second clock cycle if the microcode instruction 
and the directly-decoded instruction, when dispatched 60 
together, occupy a second number of issue positions less 
than or equal to the total number of the plurality of issue 
positions. 

The present invention further contemplates a method of 
dispatching instructions to a plurality of issue positions in a 65 
miaoprocessor comprising multiple steps. A state bit is set 
upon dispatch of a first instruction and a second instruction, 



wherein one of the first instruction and the second instruc- 
tion is a microcode instruction and the other is a directly- 
decoded instruction. The second instruction is redispaiched 
upon receipt of an indication that the microcode instruction 
occupies a first number of the plurality of issue positions 
which, when added to a second number of issue positions 
occupied by the directly-decoded instruction, exceeds a total 
number of the plurality of issue positions. The rcdi^atch 
occurs if the state bit is set. 

The present invention still further contemplates a method 
for concurrently dispatching a microcode instruction and a 
directly-decoded instruction to a plurality of issue positions 
comprising several steps. A first plurality of instructions is 
dispatched during a first clock cycle, and includes the 
microcode instruction and the directly-decoded instruction. 
A second plurality of instructions is selected for dispatch 
during a second clock cycle subsequent to the first clock 
cycle. The second plurality of instructions is greater in 
number than a total number of the plurality of issue 
positions, and includes at least one of the first plurality of 
insuiictions. From the second plurality of instructions, at 
least one instruction is dispatched during the second clock 
cycle. The instruction is one of the first plurality of instruc- 
tions if the microcode instruction is determined to occupy a 
first number of issue positions which, when added to a 
second number of issue positions occupied by a remainder 
of the first plurality of instructions, is greater than the total 
number of the plurality of issue positions. The instruction is 
not one of the first plurality of instructions if the first number 
added to the second number is less than or equal to the total 
number of issue positions. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Other objects and advantages of the invention will 
become apparent upon reading the following detailed 
description and upon reference to the accompanying draw- 
ings in which: 

FIG. 1 is a block diagram of one embodiment of a 
superscalar microprocessor. 

FIG. 2 is a block diagram of one embodiment of a pair of 
decode units shown in FIG. 1. 

FIG. 3 is a diagram depicting a portion of an instruction 
processing pipeline employed by one embodiment of the 
microprocessor shown in FIG. 1. 

FIG. 4 is a block diagram of one embodiment of an 
instruction cache and an instruction alignment unit shown in 
FIG. 1. 

FIG. 5 is a flowchart illustrating operation of the instruc- 
tion alignment unit shown in FIG. 4 according to one 
embodiment of the instruction alignment unit. 

FIG. 6 is a logic diagram of one embodiment of a 
multiplex to issue unit shown in FIG. 4. 

FIG. 7 is a table depicting combinations of instructions 
which may be stored in a byte queue depicted in FIG. 4, 
according to one embodiment of the byte queue, 

FIG. 8 is a tabic depicting instructions analyzed by a 
selection control unit shown in FIG. 4, according to one 
embodiment of the selection control unit. 

FIG. 9 is a table of issue position combinations which are 
selected by the selection control unit shown in FIG. 4, 
according to one embodiment of the selection control unit. 

FIG. LO is an example of instruction selection. 

FIG. U is a second example of instruction selection. 

FIG. 12 is a block diagram of a computer system includ- 
ing the microprocessor shown in FIG. 1. 
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While the iDvention is susceptible lo various modifica- lions are selected for dispatch. Instructions are selected for 
lions and alleraalive forms, specific embodimenls thereof dispatch based upon predecodeinformalion, which identifies 
are shown by way of example in the drawings and will microcode instructions as opposed to dircctly-dccodcd 
herein be described in detail It should be understood, instructions but does not identify double dispatch versus 
however, that the drawings and detailed description thereto 5 arbitrary dispatch microcode instructions. The instruction 
are not intended to Umit the invention to the particular fonn dispatch selection logic within microprocessor 10 assumes 
disclosed, but on the contrary, the intention is to cover all that an MROM mstruction selected for dispatch is a double- 
modifications, equivalents and alternatives faUing within the dispatch instruction, and therefore ^l^'^^^^'^^^^y/^^f^ 
spirit and scope of ihe present invention as defined by the inst™ction for concurrent dispatch. If MROM unit 34 
A A I • in '^^^^'^^ microcode instruction is not a double dis- 
appcn c c aims. ^^^^^ instruction, then either ihe microcode instruction or 
DETAILED DESCRIPTION OF THE the directly-decoded instruction is redispatched during the 
INTVENTION following clock cycle (whichever one is second in program 

Ti • » mr- 1 Ki L. ^- «F order). If the nucrocode instruction is redispatched, it is 

Tumrne now lo FIG. 1, a block diagram or one embooi- . . , ir j- ^1 j j j • . / 

. f ■ „ ^ in ■ „™ \/'™ in redispatched alone. If the directly-decoded instruction is 

mem of a microprocessor 10 is shown. Microprocesso 10 ^^^J.^^ed, it may be redispatched along with subsequent 

includes a prefetch/predecode umt U, a branch prediction i^sj^^jions; Advantageously, even though microcode 

unit 14, an instniction cache 16, an instrucUoD alignment instructions are not classified until after dispatch, concurrent 

unit 18, a plurality of decode umts 20A-20C, a plurality of dispatch of microcode and direcUy-decoded instructions 

reservation siati0Ds22A-22C, a plurality of functional units ^^^y be accomplished. Still further, subsequent dispatch 

24A-24C, a load/store unit 26, a data cache 28, a register file bandwidth may not be wasted since redispatched directly- 

30, a reorder buffer 32, and an MROM unit 34. Blocks decoded instmctions may be dispatched concurrently with 

referred to herein with a reference number followed by a subsequent instructions. 

letter will be coUectively referred to by ihe reference number Microprocessor 10 is configured to align instructions from 

alone. For example, decode uoils 20A-20C will be collcc- instruction cache 16 to decode units 20 using instruction 

tively referred lo as decode unils 20. 25 alignment unit 18. Instructions are fetched as an aligned 

Prefetch/predecode unit 12 is coupled to receive instruc- plurality of bytes from a cache line within instruction cache 

lions from a main memory subsystem (not shown), and is 16. Instructions of interest may be stored beginning at any 

further coupled lo instraction cache 16 and branch predic- arbitrary byte within the fetched bytes. For example, a 

tion unit 14. Similarly, branch prediction unit 14 is coupled branch instruction may be executed having a target address 

to instruction cache 16. Still furtiier, branch prediction unit 30 which lies within a cache line. The instructions of interest 

14 is coupled lo decode unils 20 and functional units 24. therefore begin at the byte identified by the target address of 

Instruction cache 16 is further coupled to MROM unit 34 the branch instruction. From the instruction bytes fetched, 

and instruction aUgnmcnl unit 18. Instruction alignment unit instruction alignment unit 18 identifies the instructions lo be 

18 is in turn coupled to decode units 20. Each decode unit executed. Instruction alignment unit 18 conveys the 

20A-20C is coupled to load/slorc unit 26 and to respective 35 instructions, in predicted program order, to decode units 20 

reservation stations 22A-22C. Reservation stations for decode and execution. 

22A-22C are further coupled lo respective functional units Instruction alignment unit 18 includes a byte queue con- 

24A-24C. Additionally, decode units 20 and reservation figured to store instruction bytes. An instruction scanning 

stations 22 are coupled to register file 30 and reorder buffer unit within instruction cache 16 separates the instructions 

32. Functional units 24 are coupled to load/store unit 26, 40 fetched into instruction blocks. Each instruction block com- 

register file 30, and reorder buffer 32 as well. Data cache 28 prises a predefined number of instruction bytes. The instruc- 

is coupled lo load/store unit 26 and to the main memory tion scanning unit identifies up lo a predefined maximum 

subsystem. Finally, MROM unit 34 is coupled to decode number of instructions within the instruction block. Instruc- 

units 20. tion identification information for each of the identified 

Generally speaking, microprocessor 10 categorizes 45 instructions is conveyed to instruction alignment unit 18 and 
microcode instructions as either double dispatch or arbitrary is stored in the byte queue. The instruction identification 
dispatch. Arbitrary dispatch microcode instructions may be information includes an indication of the validity of the 
dispatched to any number of issue positions, and are there- instruction, as well as indications of the start and end of the 
fore dispatched without other instructions. In one particular instruction within the predefined number of instruction 
embodiment, two subclasses of arbitrary dispatch are 50 bytes. In one embodiment, the predefined number of insiruc- 
included: triple dispatch and more than triple dispatch. tion bytes comprises eight instruction bytes stored in con- 
Conversely, double dispatch instructions occupy a pair of tiguous main memory storage locations. The eight instruc- 
issue positions (i.e. double dispatch microcode instructions tion bytes arc aligned lo an eight byte boundary (i.e. the least 
are parsed into a pair of simpler instructions). In the embodi- significant three bits of the address of the first of the 
ment of FIG. 1, microprocessor 10 includes three issue S5 contiguous bytes arc zero). If more than tiic maximum 
positions. A double dispatch instruction does not occupy one number of instructions are contained within a particular 
of the issue positions. Therefore, a directly-decoded instruc- predefined number of instruction bytes, tiie instruction bytes 
tion may be dispatched concurrently with the double dis- are scanned again during a subsequent clock cycle. The 
patdi instructions. The direcUy-decoded instniction may be same instruction bytes are conveyed as another instruction 
immediately prior to or immediately following the double 60 block, with the additional instructions within the instruction 
dispatch inslniction in program order. Advantageously, dis- bytes identified by the accompanying instruction identifica- 
patch bandwidth is not wasted for cases in which double tion information. Therefore, an instruction block may be 
dispatch microcode instructions are encountered. Instead, defined as up to a predefined maximum number of instmc- 
thc remaining issue position is filled with a directly -decoded lions contained within a predefined number of instruction 
instruction. 65 bytes. 

MROM unil 34 may detect double dispatch microcode The byte queue stores each instruction block and corre- 

instructions subsequent to the clock cycle in which instruc- spending instruction identification information within a sub- 
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queue defiDed iherein. Tbe subqueues mchide a position for 
each possible valid instnictioo within the instmction block. 
The positions store instruction identification information 
and are maintained such that the instruction identification 
information for the first valid instruction within the sub- 5 
queue is stored in a first position within the subqueuc, 
instruction identification information regarding the second 
valid instruction (in program order) is stored in a second 
position within the subqueue, etc. When instructions within 
the subqueue are dispatched, instruction identification infor- 
mation corresponding to subsequent instructions are shifted 
within the positions of the subqueue such that tbe first of the 
remaining instructions is stored in tbe first position. 
Advantageously, instruction alignment unit 18 may only 
consider the instruction information stored in the first posi- 
tion of each subqueue to detect tbe instruction to be dis- 
patdied to decode unit 20A. Similarly, only the second 
position of the first subqueue (the subqueue storing instruc- 
tions prior to the instmctions stored in the other subqueues 
in program order) may be considered for dispatch of instruc- ^ 
lions to decode unit 20B. By managing the subqueues in this 
manner, logic for selecting and aligning instructions may be 
simplified. Fewer cascaded levels of logic may be employed 
for performing the selection and alignment process, allow- 
ing for high frequency implementation of microprocessor 
10. 

Because instructions are variable length, an instruction 
may begin within a particular instruction block but end in 
another instruction block. Instructions beginning within a 
particular instruction block and ending in another instruction 30 
block arc referred to as "overflow instructions". The sub- 
queue storing the instmction block within which an overflow 
instruction begins uses the last position to store the overflow 
instruction's identification information. Unlike the other 
positions, the instmction identification information of the 35 
last position is not shifted from the last position when an 
overflow instruction is stored therein. Advantageously, 
instruction ahgnmenl unit 18 need only search the last 
position of a particular subqueuc to identify an instruction 
overflowing from one subqueue to another. 40 

As used herein, the term queue refers to a storage device 
for storing a plurality of data items. The data items arc stored 
with an ordered relationship between them. For example, the 
data items of the byte queue arc instructions. The onlcrcd 
relationship between the instructions is the program order of 45 
the instructions. Data items are removed from the queue 
according to the ordered relationship in a first in-first out 
(FIFO) fashion. Additionally, the term shifting is used to 
refer to movement of data items within the queue. When a 
data item is shifted from a first storage location to a second 50 
storage location, the data item is copied from the first storage 
location to the second storage location and invalidated in the 
second storage location. The invalidation may occur by 
shifting yet another data item into the second storage 
location, or by resetting a valid indication in the second 55 
storage location. 

Instruction cache 16 is a high speed cache memory 
provided to store instructions. Instructions are fetched from 
instruction cache 16 and dispatched to decode units 20. In 
one embodiment, instruction cache 16 is configured to store <so 
up to 32 kilobytes of instructions in an 8 way set associative 
structure having 32 byte lines (a byte comprises 8 binary 
bits). Instruction cache 16 may additionally employ a way 
prediction scheme in order to speed access times to the 
instruction cache. Instead of accessing tags identifying each 65 
line of instructions and comparing the tags to the fetch 
address to select a way, instruction cache 16 predicts the way 



that is accessed. In this manner, tbe way is selected prior to 
accessing the instruction storage. The access time of instruc- 
tion cache 16 may be similar to a direct-mapped cache. A tag 
comparison is performed and, if the way prediction is 
incorrect, the correct instructions are fetched and the incor- 
rect instructions arc discarded. It is noted that instruction 
cache 16 may be implemented as a fiilly associative, set 
associative, or direct mapped configuration. 

Instructions are fetched from main memory and stored 
into instruction cache 16 by prefetch/pre decode imit 12. 
Instructions may be prefetched prior to instmction cache 16 
recording a miss for the instructions in accordance with a 
prefetch scheme. A variety of prefetch schemes may be 
employed by prefetch/predecode unit 12. As prefetch/ 
prcdccodc unit 12 transfers instructions from main memory 
to instruction cache 16, prefetch/predecode unit 12 generates 
three predecode bits for each byte of the instmctions: a start 
bit, an end bit, and a functional bit. The predecode bits form 
tags indicative of the boundaries of each instruction. The 
predecode tags may also convey additional information such 
as whether a given instruction can be decoded directly by 
decode units 20 or whether the instmction is executed by 
invoking a microcode procedure controlled by MROM unit 
34, as will be described in greater detail below. Still further, 
prefetch/predecode unit 12 may be configured to detect 
branch instructions and to store branch prediction informa- 
tion corresponding to the branch instmctions into branch 
prediction unit 14. 

One encoding of the predecode tags for an embodiment of 
microprocessor 10 employing the x86 insu^ction set will 
next be described. If a given byte is the first byte of an 
insuaiction, the start bit for that byte is set. If the byte is the 
last byte of an instruction, the end bit for that byte is set. For 
this embodiment of microprocessor 10, instructions which 
may be directly decoded by decode units 20 arc referred to 
as "fast path" instructions. Fast path instructions may be an 
example of direaly-deooded instructions for this embodi- 
ment. The remaining x86 instructions are referred to as 
MROM instructions, according to one embodiment. For this 
embodiment, MROM instructions are an example of micro- 
code insuructions. 

For fast path instructions, the functional bit is set for each 
prefix byte included in the instruction, and cleared for other 
bytes. Alternatively, for MROM instructions, the functional 
bit is cleared for each prefix byte and set for other bytes. The 
type of instruction may be delermined by examining (he 
fiinctional bit corresponding to the end byte. If that func- 
tional bit is clear, the instruction is a fast path instmction. 
Conversely, if that functional bit is set, the instruction is an 
MROM instruction. The opcode of an instruction may 
thereby be located within an instruction which may be 
directly decoded by decode units 20 as the byte associated 
with the first clear functional bit in the instruction. For 
example, a fast path instruction including two prefix bytes, 
a Mod R/M byte, and an SIB byte would have start, end, and 
functional bits as follows: 



StAit bits 
Eadbita 
Ptioctioual bits 



10000 
00001 
11000 



MROM instructions are instructions which are deter- 
mined to be too complex for decode by decode units 20. 
MROM instmctions are executed by invoking MROM unit 
34. More specifically, when an MROM instruction is 
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encountered, MROM unit 34 parses and issues ihe iostruc- units 20 dispatch the memory operations to load/store imit 

tion into a subset of defined fast path instructions to effec* 26. Each instruction is decoded into a set of control values 

tuatc the desired operation. MROM unit 34 dispatches the for functional units 24, and these control values arc dis- 

subsei of fast path instniclions to decode units 20. A listing patched to reservation stations 22 along with operand 

of exemplary x86 instructions categorized as fast path 5 address information and displacement or immediate data 

instructions will be provided further below. which may be included with the instruction. 

Microprocessor 10 employs branch prediction in order to Microprocessor 10 supports out of order execution, and 

speculatively fetch instructions subsequent to conditional thus employs reorder buffer 32 to keep track of the original 

branch instructions. Branch prediction unit 14 is inchided to program sequence for register read and write operations, to 

perform branch prediction operations. In one embodiment, 10 implement register renaming, to allow for speculative 

up to two branch target addresses are stored with respect to instruction execution and branch misprediction recovery, 

each cache line in instruction cache 16. Prefetch/predecode and to facilitate precise exceptions. A temporary storage 

unit 12 determines initial branch targets when a particular location within reorder buffer 32 is reserved upon decode of 

line is predecoded. Subsequent updates to the branch targets an instruction that involves the update of a register to 

corresponding to a cache line may occur due to the execution 15 thereby store speculative register states. If a branch predic- 

of instructions within the cache line. Instruction cache 16 tion is incorrect, the results of speculatively-executed 

provides an indication of the instruction address being instructions along the mispredicted path can be invalidated 

fetched, so that branch prediction unit 14 may determine in the buffer before they arc written to register file 30. 

which branch target addresses to select for forming a branch Similarly, if a particular instruction causes an exception, 

prediction. Decode units 20 and functional units 24 provide 20 instructions subsequent to the particular instruction may be 

update information to branch prediction unit 14. Because discarded. In this manner, exceptions are "precise" (i.e. 

branch prediction unit 14 stores two targets per cache line, instructions subsequent to the particular instruction causing 

some branch instructions within the line may not be stored the exception are not completed prior to the exception). It is 

in branch prediction unit 14. Decode units 20 detect branch noted that a particular mstruction is speculatively executed 

instructions which were not predicted by branch prediction 25 if it is executed prior to instructions which precede the 

unit 14. Functional tmits 24 execute the branch instructions particular instruction in program order. Preceding instruc- 

and determine if the predicted branch direction is incorrect. tioos may be a branch instruction or an exception-causing 

The branch direction may be "taken", in which subsequent insuiiction, in which case the speculative results may be 

instructions are fetched from the target address of the branch discarded by reorder buffer 32. 

instruction. Conversely, the branch direction may be "not 20 The instruction control values and immediate or displace- 

taken", in which subsequent instructions are fetched from mcnt data provided at the outputs of decode units 20 arc 

memory locations consecutive to the branch insmiciion. routed directly to re^ective reservation stations 22. In one 

When a mispredicted branch instruction is detected, instruc- embodiment, each reservation station 22 is capable of hold- 

tions subsequent to the mispredicted branch are discarded ing instruction information (i.e., instruction control values as 

from the various units of microprocessor 10. A variety of 35 well as operand values, operand tags and/or immediate data) 

suitable branch prediction algorithms may be employed by for up to three pending instructions awaiting issue to the 

branch prediction unit 14. corresponding ftinctional imit. It is noted that for the 

Instructions fetched from instruction cache 16 are con- embodiment of FIG. 1, each reservation station 22 is asso- 

veyed to instruction alignment unit 18. As instructions are ciated with a dedicated functional unit 24. Accordingly, 

fetched from instmction cache 16, the corresponding pre- 40 three dedicated "issue positions" arc formed by reservation 

decode data is scanned to provide information to instruction stations 22 and functional units 24. In other words, issue 

alignment unit 18 (and to MROM unit 34) regarding the position 0 is formed by reservation station 22A and func- 

instructions being fetched. Instruction alignment unit 18 tional unit 24A. Instructions aligned and dispaldied to 

utilizes the scanning data to align an instruction to each of reservation station 22Aarc executed by functional unit 24A 

decode units 20. In one embodiment, instruction alignment 45 Similarly, issue position 1 is formed by reservation station 

unit 18 aligns instructions from three sets of eight instruction 22B and functional unit 24B; and issue position 2 is formed 

bytes to decode units 20. Decode unit 20A receives an by reservation station 22C and functional unit 24C. As used 

instruction which is prior to instructions concurrently herein, the term "issue position" refers to logic circuitry 

received by decode units 20B and 20C (in program order). configured to receive an instruction and to execute that 

Similarly, decode unit 20B receives an instruction which is 50 insuuction. Once the instruction enters the issue position, it 

prior to the instruction concurrently received by decode unit remains in that issue position until the execution of the 

20c in program order As used herein, the term "program instruction is completed. 

order" refers to the order of the instruction as coded in the Upon decode of a particular instruction, if a required 
original sequence in memory. The program order of instruc- operand is a register location, register address information is 
tions is the order in which the instructions would be 55 routed to reorder huffier 32 and register file 30 simulta- 
executed upon a microprocessor which fetches, decodes, neously. Those of skill in the art wiU appreciate that the x86 
executes, and writes the result of a particular instruction register file includes eight 32 bit real registers (i.e., typically 
prior to fetching another instruction. Additionally, the term referred to as EAX, EBX, ECX, EDX, EBP, ESI, EDI and 
"di^atch" is used to refer to conveyance of an instruction to ESP). In embodiments of microprocessor 10 which employ 
an issue position which is to execute the instructioa Issue 60 the x86 microprocessor architecture, register file 30 com- 
positions may also dispatch load/store memory operations to prises storage locations for each of the 32 bit real registers, 
load/store unit 26. Additional storage locations may be included within register 
Decode units 20 are configured to decode instructions file 30 for use by MROM unit 34. Reorder buffer 32 contains 
received from instruction alignment unit 18. Register oper- temporary storage locations for results whidi change the 
and information is detected and routed to register file 30 and 65 contents of these registers to thereby allow out of order 
reorder buffer 32. Additionally, if the instructions require execution. A temporary storage location of reorder buffer 32 
one or more memory operations to be performed, decode is reserved for each instruction which, upon decode, is 
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dciermined to modify ihe contents of one of the real regis- 
ters. Therefore, at various points during execution of a 
particular program, reorder buffer 32 may have one or more 
locations which contain the speculatively executed contents 
of a given register. If following decode of a given instruction s 
it is determined that reorder buffer 32 has a previous location 
or locations assigned to a register used as an operand in the 
given instruction, the reorder buffer 32 forwards to the 
corresponding reservation station either 1) the value in the 
most recently assigned location, or 2) a tag for the most 
recently assigned location if the value has not yet been 
produced by the functional unit that will eventually execute 
the previous instruction. If reorder buffer 32 has a location 
reserved for a given register, the operand value (or reorder 
buffer lag) is provided from reorder buffer 32 rather than j5 
from register file 30. If there is no location reserved for a 
required register in reorder buffer 32, the value is taken 
directly from register file 30. If the operand corresponds to 
a memory location, the operand value is provided lo the 
reservation station through load/store unit 26. 

In one particular embodiment, reorder buffer 32 is con- 
figured to store and manipulate concurrently decoded 
instructions as a unit. This configuration will be referred to 
herein as "Une-oriented". By manipulating several instruc- 
tions together, the hardware employed within reorder buffer 25 
32 may be simplified. For example, a line-oriented reorder 
buffer included in the present embodiment allocates storage 
sufiBcient for instruction information pertaining to three 
instructions (one from each decode unit 20) whenever one or 
more instructions are dispatched by decode imits 20. By 30 
contrast, a variable amount of storage is allocated in con- 
ventional reorder buffers, dependent upon the number of 
instructions actually dispatched. A comparatively larger 
number of logic gates may be required to allocate the 
variable amount of storage. When each of the concunenily 35 
decoded instructions has executed, the instruction results are 
stored into register file 30 simultaneously. The storage is 
then free for allocation to another set of concurrently 
decoded instructions. Additionally, the amount of control 
logic circuitry employed per instruction is reduced because do 
the control logic is amortized over several concurrently 
decoded instructions. A reorder buffer tag identifying a 
particular instruction may be divided into two fields: a line 
tag and an offset tag. The line tag identifies the set of 
concurrently decoded instructions including the particular 45 
instruction, and the oSset tag identifies which instruction 
within the set corresponds to the particular instruction. It is 
noted that storing inslruaion results into register file 30 and 
freeing the corresponding storage is referred to as "retiring" 
the instructions. It is further noted that any reorder buffer 50 
configuration may be employed in various embodiments of 
microprocessor 10. 

As noted earlier, reservation stations 22 store instructions 
until the instructions are executed by the corresponding 
fimctional unit 24. An instruction is selected for execution if: ss 
(i) the operands of the instruction have been provided; and 
(u) (be operands have not yet been provided for instructions 
which arc within the same reservation station 22A-22C and 
which are prior to the instruction in program order. It is 
noted that when an instruction is executed by one of the 60 
functional units 24, the result of that instruction is passed 
directly to any reservation stations 22 that arc waiting for 
that result at the same time the result is passed to update 
reorder buffer 32 (this technique is commonly referred to as 
"result forwarding"). An instruction may be selected for 65 
execution and passed to a fiinctiooal unit 24A-24C during 
the clock cycle that the associated result is forwarded. 



Reservation stations 22 route the forwarded result to the 
functional unit 24 in this case. 

la one embodiment, each of the functional units 24 is 
configured to perform integer arithmetic operations of addi- 
tion and subtraction, as well as shifts, rotates, logical 
operations, and branch operations. The operations are per- 
formed in response to the control values decoded for a 
particular instruction by decode units 20. It is noted that a 
floating point unit (not shown) may also be employed to 
accommodate floating point operations. The floating point 
unit may be operated similar to load/store unit 26 in that any 
of decode units 20 may dispatch instmctions to the floating 
point unit. Additionally, functional units 24 may be config- 
ured to perform address generation for load and store 
memory operations performed by load/store unit 26. 

Each of the functional imits 24 also provides information 
regarding the execution of conditional branch instructions to 
the branch prediction unit 14. If a branch prediction was 
incorrect, branch prediction unit 14 flushes instructions 
subsequent lo the mispredicted branch that have entered the 
instruction processing pipeline, and causes fetch of the 
required instructions (torn instruction cache 16 or main 
memory. It is noted that in such situations, results of 
instructions in the original program sequence which occur 
after the mispredicted branch instruction are discarded, 
including those which were speculatively executed and 
temporarily stored in load/store unit 26 and reorder buffer 
32. 

Results produced by functional units 24 are sent to reorder 
buffer 32 if a register value is being updated, and to 
load/store unit 26 if the contents of a memory location arc 
changed. If the result is to be stored in a register, reorder 
buffer 32 stores the result in the location reserved for the 
value of the register when the instmction was decoded. A 
plurality of result buses 38 arc included for forwarding of 
results from functional units 24 and load/store unit 26. 
Result buses 38 convey the result generated, as well as the 
reorder buffer tag identifying the instruction being executed. 

Load/store unit 26 provides an interface between func- 
tional units 24 and data cache 28. In one embodiment, 
load/store unit 26 is configured with a load/store buffer 
having eight storage locations for data and address infor- 
mation for pending loads or stores. Decode units 20 arbitrate 
for access to the load/store unit 26. When the buffer is full, 
a decode unit must wait until load/store imit 26 has room for 
the pending load or store request information. Load/store 
unit 26 also performs dependency checking for load memory 
operations against pending store memory operations to 
ensure that data coherency is maintained. A memory opera- 
tion is a transfer of data between microprocessor 10 and the 
main memory subsystem. Memory operations may be the 
result of an instruction wbidi utilizes an operand stored in 
memory, or may be the result of a load/store instruction 
which causes the data transfer but no other operation. 
Additionally, bad/store unit 26 may include a special reg- 
ister storage for special registers such as the segment reg- 
isters and other registers related to the address translation 
mechanism defined by the x86 microprocessor architecture. 

In one embodiment, load/store unit 26 is configured to 
perform load memory operations speculatively. Store 
memory operations are performed in program order, but may 
be speculatively stored into the predicted way. If the pre- 
dicted way is incorrect, the daU prior to the store memory 
operation is subsequently restored to the predicted way and 
the store memory operation is perfonned to the correct way. 
In another embodiment, stores may be executed specula- 
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lively as well. SpecuUUvely executed stores are placed into 
a store buffer, aloog with a copy of the cache Uae prior to the 
update. If the speculatively executed store is later discarded 
due to branch misprediction or exception, the cache hnc may 
be restored to the value stored in the buffer. It is noted that 5 
load/store unit 26 may be configtircd to perform any amount 
of speculative execution, including no speculative execu- 
tion. 

Data cache 28 is a high speed cache memory provided to 
temporarily store data being transferred between load/store 
unit 26 and the main memory subsystem. In one 
embodiment, data cache 28 has a capacity of storing up to 
sixteen kilobytes of data in an eight way set associative 
structure. Similar to instruction cache 16, data cache 28 may 
employ a way prediction mechanism. It is understood that 15 
data cache 28 may be implemented in a variety of specific 
memory configurations, including a set associative configu- 
ration. 

In one particular embodiment of microprocessor 10 
employing the x86 microprocessor architecture, instruction ^ 
cache 16 and data cache 28 are linearly addressed. The linear 
address is formed from the ofiset specified by the instruction 
and the base address specified by the segment portion of the 
xS6 address translation mechanism. Linear addresses may 
optionally be translated to physical addresses for accessing ^ 
a main memory. The Hnear to physical translation is speci- 
fied by the paging portion of the x86 address translation 
mechanism. It is noted that a linear addressed cache stores 
linear address tags. A set of physical tags (not shown) may 
be employed for mapping the linear addresses to physical 
addresses and for detecting translation aliases. Additionally, 
the physical tag block may perform linear to physical 
address translation. 

■Riraing now to FIG. 2, a block diagram of one embodi- 
mcnt of decode units 20B and 20C arc shown. Each decode 
unit 20 receives an instruction from instruction alignment 
unit 18. Additionally, MROM unit 34 is coupled to each 
decode unit 20 for dispatching fast path instructions corre- 
sponding to a particular MROM instruction. Decode unit 
20B comprises early decode unit 40B, multiplexor 42B, and 
opcode decode unit 44B. Similarly, decode unit 20C 
includes early decode unit 40C, multiplexor 42C, and 
opcode decode unit 44C. 

Certain instructions in the x86 inslniclion set are both 45 
fairly complicated and frequently used. In one embodiment 
of microprocessor 10, such instructions include more com- 
plex operations than the hardware included within a par- 
ticular functional unit 24A-24C is configured to perform. 
Some of such instructions are classified as a special type of 50 
MROM instruction referred to as a "double dispatch" 
instruction. These instructions are dispatched to a pair of 
opcode decode units 44 by MROM unit 34. It is noted that 
opcode decode units 44 are coupled to respective reservation 
stations 22. Each of opcode decode units 44A-44C forms an 55 
issue position with the corresponding reservation station 
22A-22C and funaional unit 24A-24C. Instructions arc 
passed from an opcode decode imit 44 to the corresponding 
reservation station 22 and further to the corresponding 
functional unit 24. 60 

Multiplexor 42B is included for selecting between the 
instructions provided by MROM unit 34 and by early 
decode unit 40B. During times in which MROM unit 34 is 
dispatching instnictions, multiplexor 42 B selects instruc- 
tions provided by MROM xmit 34. At other times, multi- 65 
plexor 42B selects instructions provided by early decode 
unit 408. Similarly, multiplexor 42C selects between 
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instructions provided by MROM unit 34, early decode unit 
40B, and early decode unit 40C. The instruction from 
MROM unit 34 is seleaed during times in which MROM 
unit 34 is dispatching instructions. During times in which 
early decode unit 40A detects a double dispatch instruction, 
the instruction from early decode unit 40B is selected by 
muhiplexor 42C. Otherwise, the instruction from early 
decode unit 40C is selected. Selecting the instruction from 
early decode imit 40B into opcode decode unit 44C allows 
a fast path instruction decoded by decode unit 20B to be 
dispatched concurrently with a double dispatch instruction 
decoded by decode unit 20 A. In this manner, instruction 
alignment unit 18 need not attempt to align MROM instruc- 
tions and concurrently dispatched fast path instructions to 
their final issue positions. Instead, the instructions may be 
aligned to a position and then adjusted between early decode 
units 40 and opcode decode units 44. 

According to one embodiment employing the x86 instruc- 
tion set, early decode units 40 perform the following opera- 
tions: 

(i) merge the prefix bytes of the instruction into an 
encoded prefix byte; 

(ii) decode unconditional branch instructions (which may 
include the unconditional jump, the CALL, and the 
RETURN) which were not detected during branch 
prediction; 

(iii) decode source and destination flags; 

(iv) decode the source and destination operands which arc 
register operands and generate operand size informa- 
tion; and 

(v) determine the dispkocment and/or immediate size so 
that displacement and immediate data may be routed to 
the opcode decode unit 

Opcode decode units 44 are configured to decode the opcode 
of the instruction, producing control values for functional 
unit 24. Displacement and immediate data are routed with 
the control values to reservation stations 22. 

Since eariy decode units 40 delect operands, the outputs 
of multiplexors 42 are routed 10 register file 30 and reorder 
buffer 32. Operand values or tags may thereby be routed to 
reservation stations 22. Additionally, memory operands are 
detected by early decode units 40. Therefore, the outputs of 
multiplexors 42 are routed to load/store unit 26. Memory 
operations corresponding to instructions having memory 
operands are stored by load/store unit 26. 

Turning now to FIG. 3, a diagram depicting instruction 
processing pipeline stages for one embodiment of micro- 
processor 10 is shown. Other embodiments of microproces- 
sor 10 may employ dissimilar instruction processing pipe- 
hncs. The instruction processing pipeline shown in FIG. 3 
inchides an instruction fetch stage SO, an instruction scan 
stage 52, a first alignment stage 54, a second alignment stage 
56, an early decode stage 58, a decode stage 60, an MROM 
entry point stage 62, an MROM access stage 64, and an 
MROM early decode stage 66. MROM entry point stage 62, 
MROM access stage 64, and MROM early decode stage 66 
correspond to MROM unit 34. Instruction fetch stage 50 and 
instruction scan stage 52 are performed by instruction cache 
16. Similarly, first and second alignment stages 54 and 56 
correspond to instruction alignment unit 18, early decode 
stage 58 corresponds to early decode units 40, and decode 
stage 60 corresponds to opcode decode units 44. 

During instruction fetch stage SO, instructions are fetched 
from instruction cache 16. The instruction cache storage is 
accessed via a fetch address provided by branch is prediction 
unit 14, and instructions are conveyed to an instruction 
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scanning unit within microprocessor 10. During iostructioa 
scan stage 52, the instructions are scanned and instructioo 
blocks are created. The instniction blocks are conveyed to 
the byte queue within first alignment stage 54. Additionally, 
MROM instructions are detected during instruction scan 
stage 52. A detected MROM instruction is routed to MROM 
entry point stage 62. In one embodiment, MROM unit 34 is 
configured lo accept one instruction per clock cycle. 
Therefore, if a secorKi MROM instruction is detected within 
a set of instruction bytes being scanned during a particular 
clock cycle, instruction blocks including the second MROM 
instruction and subsequent instructions in program order are 
stalled until a subsequent clock cycle. 

During first alignment stage 54, instructions are selected 
from the byte queue included therein for dispatch. In one 
embodiment, up lo four instructions are selected for dispatch 
from which up to three instructions are actually dispatched, 
as detailed further below. The instructions are conveyed to 
second alignment stage 56, subsequently to early decode 
stage 58 and then to decode stage 60. 

MROM entry point stage 62 is used to determine a 
location within the ROM storage of MROM unit 34 at which 
the first instructions corresponding to a particular MROM 
instruction are stored. The address is passed to MROM 
access stage 64, which accesses the ROM storage and 
receives the instructions stored therein. In one embodiment, 
a line of instructions (i.e. up to the number of instructions 
which may be stored in a reorder buffer line) are received 
during one cycle of ROM storage access. The line of 
instructions is then transmitted to MROM early decode stage 
66, which formats the instructions similar to the formatting 
of early decode units 40 (such that opcode decode units 44 
detect only one type of instruction formatting). The line of 
instructions is then inserted into opcode decode units 44 via 
multiplexors 42. For MROM instructions which employ 
more than a single line of instructions, additional MROM 
accesses are performed in MROM access stage 64 and 
subsequent lines of instmctions conveyed to MROM early 
decode stage 66 during subsequent clock cycles. First align- 
ment stage 54 and second alignment stage 56 are stalled 
during such subsequent clock cycles. 

It is noted that, since first alignment stage 54 includes a 
byte queue storing multiple instructions and the instructions 
are selected therefrom in program order, a particular MROM 
instruction may arrive in MROM access stage 64 prior to 
being selected for dispatch from the byte queue. The par- 
ticular MROM instruction may be subsequent to a large 
number of instructions within the byte queue, and instruc- 
tions are selected for dispatch in program order. (MROM 
instructions are routed to MROM unit 34 but are not 50 
removed from the instruction blocks conveyed to instruction 
aUgnment unit 18.) Alternatively, the particular MROM 
instruction may be queued in MROM unit 34 while a prior 
MROM instniction executes. The particular MROM instruc- 
tion may be selected for dispatch prior to arriving at MROM 
access stage 64. Therefore, synchronization is provided 
between second alignment stage 56 and MROM access stage 
64 (illustrated by synchronization bus 67). 

When MROM access stage 64 receives an entry point 
address from MROM entry point stage 62, MROM access 
stage 64 informs second align stage 56 by asserting a signal 
upon synchronization bus 67. When second alignment stage 
56 receives a dispatched MROM instruction from first 
alignment stage 54, second alignment stage 56 signals 
MROM access stage 56 via synchronization bus 67. In this 
manner, the MROM instruaion progresses to both MROM 
early decode stage 66 and early decode stage 58 during the 
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same clock cycle. Because both second alignment stage 56 
and MROM access stage 64 receive instructions in program 
order, it is sufGdent to synchronize instructions via synchro- 
nization bus 67. 

During MROM entry point stage 62, MROM unit 34 
determines if a particular MROM instruction is double 
dispatch. A particular MROM instruction is double dispatch 
if the particular MROM instruction corresponds to a single 
line of instructions within which two instructions are stored. 
If MROM unit 34 detects a double dispatch instruction, a 
double dispatch signal upon a double dispatch conductor 68 
is asserted. Otherwise, the double dispatch signal is dcas- 
serted. The double dispatch signal is conveyed to both first 
alignment stage 54 and second alignment stage 56. Second 
alignment stage 56 uses the state of the double dispatch 
signal to determine if instructions dispatched during the 
previous clock cycle (in first alignment stage 54 and there- 
fore currenUy residing in second alignment stage 56) should 
be discarded. More particularly, second alignment stage 56 
discards the second instruction in program order if: (i) an 
MROM instruction and a fast path instruction were concur- 
rently dispatched; and (ii) the double dispatch signal is 
deasscrtcd. Otherwise, second alignment stage 56 passes the 
instructions to early decode stage 58. It is noted that, 
although MROM entry point stage 62 detects the double 
dispatch nature of an MROM instruction, the double dis- 
patch signal as shown in FIG. 3 is asserted from the MROM 
access stage 64. Alternatively, the double dispatch signal 
may be asserted from MROM entry point stage 62 and 
instruction alignment unit 18 may store the signal value for 
use in the subsequent clock cycle. 

First alignment stage 54 uses the double dispatch signal as 
well. When first alignment stage 54 concurrently dispatches 
an MROM instruction and a fast path instruction (referred to 
herein as "packing"), the second of the two instructions in 
program order is retained in the byte queue. During each 
clock cycle, first alignment stage 54 initially selects up to 
four instructions for dispatch during a particular clock cycle. 
If first alignment stage 54 packed during the previous clock 
cycle and the double dispatch signal is asserted, then the first 
of the four instructions (in program order) is ignored and the 
remainder arc dispatched. Conversely, if first alignment 
stage 54 did not pack during the previous clock cycle or the 
double dispatch signal is deasserted, the first three of the 
four instructions (in program order) arc dispatched and the 
fourth is retained by the byte queue. In this manner, redis- 
patch of the second of the packed instructions is performed 
when needed without sacrificing other dispatch positions. 

Turning now to FIG. 4, a block diagram of one embodi- 
ment of instruction cache 16 and instruction ahgnment unit 
18 are shown. Instruction cache 16 includes an instruction 
cache storage and control block 70 and an instruction 
scanning unit 72. Instruction alignment unit 18 includes a 
byte queue 74, a selection control unit 76, and a multiplex 
to issue block 78. 

Instruction cache storage and control block 70 includes 
storage for instniction cache lines and related control cir- 
cuitry for fetching instructions from the storage, for select- 
ing cache lines to discard when a cache miss is detected, etc. 
Instruction cache storage and control block 70 receives fetch 
addresses from branch prediction unit 14 (not shown) in 
order to fetch instructions for execution by microprocessor 
10. Instruction bytes fetched from instruction cache storage 
and control block 70 are conveyed to instruction scanning 
unit 72 upon an instructions bus 80. Instruction bytes are 
conveyed upon instructions bus 80, as well as corresponding 
predecode data (e.g. start, end, and functional bits). In one 
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embodimeat, sixteen bytes stored in contiguous memory 
locations are conveyed upon instructions bus 80 along with 
the corresponding predecode data. The sixteen bytes form 
cither the upper or lower half of tbe 32 byte cache line. The 
upper half of the cache line is the half stored in memory 
addresses having larger numerical values, while the lower 
half is stored in memory addresses having smaller numerical 
values. Additionally, instruction scanning unit 72 receives 
information regarding the bytes within the sixteen bytes 
which arc to be conveyed as instructions to instruction 
alignment imit 18. Instruction bytes at the beginning of the 
sixteen bytes may be ignored if the bytes are fetched as the 
target of a branch instruction, and the target address iden- 
tifies a byte other than the first byte of the sixteen bytes. 
Additionally, if a branch instruction is within the sixteen 
bytes and branch prediction imit 14 predicts the branch 
taken, then bytes subsequent to the branch instruction within 
the sixteen bytes are ignored. 

Instruction scanning unit 72 scans the predecode data 
associated with the bytes which are to be conveyed as 
instructions to instruction alignment unit 18. In the present 
embodiment, instruction scanning unit 72 divides the sixteen 
bytes conveyed by instruction cache storage and control 
block 70 into two portions comprising eight contiguous 
bytes each. One portion forms the lower half of the sixteen 
bytes (i.e. the bytes stored at smaller numerical addresses 
than the bytes forming the upper half of the sixteen bytes). 
The other portion forms the upper half of the sixteen bytes. 
Therefore, an eight byte portion forms one of four quarters 
of the 32 byte cache line employed by instruction cache 
storage and control block 70, according to the present 
embodiment. As used herein, bytes arc contiguous if Ihcy arc 
stored in contiguous memory locations in the main memory 
subsystem. It is noted that particular sizes of various 
components, such as instruction block sizes, are used herein 
for clarity of the description. Any size may be used for each 
component within the spirit and scope of the appended 
claims. 

Instruction scanning unit 72 scans the predecode data of 
each portion of the instructions independently and in paral- 
lel. Instruction scanning unit 72 identifies up to a predefined 
maximum number of instructions within each portion from 
the start and end byte information included within tbe 
predecode data. For the present embodiment, the predefined 
maximum number is three. Generally speaking, instruction 
scanning unit 72 preferably identifies a maximum number of 
instructions in each portion equal to the number of issue 
positions included within microprocessor 10. 

The instruction bytes and instruction identification infor- 
mation generated by instruction scanning unit 72 are con- 
veyed to byte queue 74 upon an instructions bus 82 and an 
instruction data bus 84, respectively. The instruction bytes 
are conveyed as eight byte portions, and the instruction data 
is arranged accordingly such that each eight byte portion is 
associated with a portion of the instruction identification 
information conveyed upon instruction data bus 84. Each 
eight byte portion and the corresponding instruction identi- 
fication information forms an instruction block. It is noted 
that, although an instruction block includes eight bytes in the 
present embodiment, instruction blocks may include any 
number of bytes in various embodiments. Byte queue 74 
receives the instruction blocks conveyed and stores them 
into one of multiple subqueues included therein. In the 
embodiment shown, byte queue 74 includes three sub- 
queues: a first subqueue 86A, a second subqueue 86B, and 
a third subqueue 86C. First subqueue 86A stores the instruc- 
tion block which is foremost among the instruction blocks 
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stored in byte queue 74 in program order. Second subqueue 
S6B stores the instruction block which is second in program 
order, and third subqueue stores the instruction block which 
is third in program order. It is noted that various cmbodi- 
5 ments of byte queue 74 may include any number of sub- 
queues 66. 

If a particular ponion as scanned by instruction scanning 
unit 72 includes more than the maximum predefined number 
of instructions, then the particular portion is retained by 

10 instruction scanning unit 72. During the following clock 
cycle, the particular eight byte portion is scanned again. The 
predecode data corresponding to the previously identified 
instructions is invalidated such that instruction scanning unit 
72 detects the additional instructions. If the other portion 

15 concurrendy received with the particular portion is subse- 
quent to the particular portion in program order, then the 
other portion is rescaimed as well. Byte queue 74 discards 
the instruction blocks received from the other portion, in 
order to retain program order among the instruction blocks 

20 stored in the byte queue. 

A control unit 90 within byte queue 74 conveys a byte 
queue status upon byte queue status bus 88 to instruction 
scanning unit 72. Byte queue status bus 88 includes a signal 
corresponding to each subqueue 86. The signal is asserted if 

25 the subqueue 86 is storing an instruction block, and deas- 
serted if the subqueue 86 is not storing an instruction block. 
In this manner, instruction scanning unit 72 may determine 
how many instruction blocks are accepted by byte queue 74 
during a clock cycle. If two instruction blocks are conveyed 

30 during a clock cycle and only one instruction block is 
accepted, instruction scanning unit 72 retains the rejected 
instruction block and rcscans the instruction block in the 
subsequent clock cycle. 
As noted above, an instruction block may contain up to a 

35 predefined maximum number of instructions (e.g. three in 
the present embodiment). Additionally, eight contiguous 
bytes are conveyed for each instruction block in the present 
embodiment However, due to the variable byte length of the 
x86 instructions, an instruction may begin within one set of 

40 contiguous bytes and end in another set of contiguotis bytes, 
referred to as overflow instructions. If an overflow instruc- 
tion is detected, it is identified as the last of the predefined 
number of instructions. Instead of being indicated as a valid 
instruction within the instruction block, the overflow 

45 instruction is identified as an overflow. Instruction identifi- 
cation information is generated, but the instruction is 
handled somewhat differently, as wiU be explained in more 
detail below. 

In one embodiment, the instruction identification infor- 

50 mation for each instruction includes: (i) start and end 
pointers identifying the bytes at which the identified instruc- 
tion begins and ends within the eight bytes; (ii) a valid mask 
containing eight bits, one for each of the eight bytes; (iii) a 
bit indicative of whether tbe instruction is MROM or fast 

55 path; and (iv) an instruction valid bit indicating that the 
instruction is valid and an overflow bit for the last instruction 
indicating that it is an overflow. The valid mask includes a 
binary one bit corresponding to each byte included within 
the particular instruction (i.e. the bits between the start 

60 pointer and end pointer, inclusive, arc set). 2^ro bits arc 
included for the other bytes. Additional information con- 
veyed with the instruction identification information is the 
takenAiot taken prediction if the instruction is a branch 
instruction, bits indicating which of the quarters of the 32 

65 byte cache line the eight bytes correspond to, the ftmctional 
bits from the predecode data corresponding to the eight 
bytes, and a segment limit identifying the segment limit 
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within the eight bytes for exception handling. The additional 
information is provided by instruction cache storage and 
control block 70 except for the branch prediction, which is 
provided by branch prediction unit 14. 

SelectioD control unit 76 examines the instruction iden- s 
tification information stored in each subqucue to generate 
selection controls for multiplex to issue blodt 78. Multiplex 
to issue block 78 includes a phiralily of multiplexors for 
selecting instruction bytes from byte queue 74 for convey- 
ance to each of decode units 20. Byte queue 74 maintains lo 
certain properties with respect to each subqueue 86 in order 
to simplify the selection logic within selection control unit 
76, as will be explained in more detail below. Instructions 
are selected and conveyed, and corresponding instructioo 
identification information is invalidated such that subse- is 
quent instructions may be dispatched in subsequent clock 
cycles. 

Subqueues 86 store instruction information in a plurality 
of instruction positions (or simply "positions"). The number 
of instruction positions is preferably equal to the maximum 20 
number of instructions which may be included in an instruc- 
tion block. For the present embodiment, three positions are 
included. The first position ("position 10") stores the instruc- 
tion identification information corresponding to the instruc- 
tion which is foremost in program order within the instruc- 25 
tion block stored in the subqueue 86. The second position 
("position II") stores the instmction identification informa- 
tion corresponding lo the second instruction in program 
order within the instmction block. Finally, the third position 
("position 12") stores the instruction identification informa- 30 
tion corresponding to the last instruction in program order. 
Alternatively, position 12 may store instruction identification 
information corresponding to an overflow instruction. Cer- 
tain instruction identification information is the same for 
each instruction (e.g. the segment limit). To avoid duplical- 35 
ing information, this instruction information may be stored 
as a single copy separate from the instructions positions. 

Control unit 90 maintains the information stored in each 
subqueue 86. In particular, control unit 90 directs each 
subqueue 86 to shift instruction identification information to 
between the positions when instructions are selected for 
dispatch. For example, if the instruction corresponding to 
position 10 is dispatched, the information stored in position 
II is shifted into position 10 and the information stored in 
position 12 is shifted into position U. Similarly, if the 45 
instructions corresponding to positions 10 and U are 
dispatched, then information stored in position 12 is shifted 
into position 10. In this manner, the instruction within the 
subqucue which is foremost in program order is maintained 
in position 10, the instruction which is second in program so 
order is maintained in position II, etc. In order lo select an 
instruction for dispatch to decode unit 20A, selection control 
unit 76 examines the instruction identification information 
stored in position 10 of each subqueue. Advantageously, a 
small amount of logic may be employed to select the ss 
instruction. Similarly, position 10 of subqueue 86 A and 
position 12 of each subqueue 86A-86C are not examined to 
select an instmction for decode unit 20B. The second 
instruction to be di^aiched will be found within the first two 
positions of one of the subqueues 66 when maintained in 60 
accordance with the above. Selection control unit 76 informs 
control unit 90 of which instructions positions were selected 
for dispatch during a clock cycle, such that subqueue shift- 
ing may be performed. 

According to one embodiment, instruction identification 65 
information is shifted internally to cadi subqueue 86 inde- 
pendently. Instruction identification information is not, 



therefore, shifted from position 10 of subqueue 86B into 
positions within subqueue 86A. Instead, when each of the 
instructions within subqueue 86A have been dispatched, 
subqucue 86B is shifted into subqueue 86A as a whole. The 
logic for shifting between subqueues 86 may operate inde- 
pendently from and in parallel with the internal shifting of 
each subqueue 86A-S6C. 

Position 12 may store instruaion identification informa- 
tion regarding an overflow instmction. If position 12 is 
storing information regarding an overflow instruction, then 
the information is not shifted to position 10 or 1 1 as dest^ibed 
above. In this manner, overflow instruction information is 
always available in position 12. Selection control unit 76 
may examine the information stored in position 12 for 
routing bytes corresponding to an overflow instruction, as 
opposed to having to locate the overflow information within 
the positions and then determining byte routing. 

Selection control unit 76 selects instmclioos from the 
instructions positions within subqueues 86 for potential 
dispatch. The instmctions selected are the instructions which 
arc foremost in program order among the instructions stored 
in subqueues 86. More instructions are initially selected for 
dispatch than the number of issue positions included in 
microprocessor 10, in order to correctly perform redispatch 
of instructions when an MROM instmction and a fast path 
instruction are concurrently dispatdied and the MROM 
instmction is found to be an arbitrary dispatch instmction. 
Selection control unit 76 then selects from the potentially 
dispatchable instructions based upon the value of a packed 
slate stored in a packed state register 92 coupled to selection 
control unit 76 and the slate of the double dispaldi signal 
upon double dispatch conductor 68, also coupled to selec- 
tion control unit 76. 

When selection control unit 76 selects an MROM instmc- 
tion and a fast path instruction for concurrent dispatch 
during a clock cycle, selection control unit 76 sets the 
packed stale. Otherwise, the packed stale is reset. The 
packed state so generated is stored into packed slate register 
92 for use during the succeeding clock cycle. Additionally, 
selection control unit 76 informs control unit 90 that the first 
of the MROM instruction and the fast path instruction (in 
program order) is being dispatched. In this manner, byte 
queue 54 retains the second of the two instructions in 
program order, despite the dispatch of the second of ihe two 
instructions. In one embodiment, the packed state comprises 
a bit indicative, when set, that an MROM instmction and a 
fast path instruction were concurrently dispatched in the 
previous clock cycle. 

From the potentially dispatchable instructions, selection 
control unit 76 selects instructions for dispatch based upon 
the packed state stored in packed state register 92 and the 
double dispatch signal. If the packed stale is set, an MROM 
instruction and a fast path instruction were concurrently 
dispatched in the previous clock cycle. Therefore, the 
instruction within the potentially dispatchable instructions 
which is foremost in program order is one of the two 
instructions previously dispatched when the packed slate is 
set. If the packed state is set and the double dispatch signal 
is asserted, the concurrent dispatch of the MROM instruc- 
tion and the fast path instruction is successful. If the packed 
slate is set and the double di^atch signal is deasserled, the 
concurrent dispatch of the MROM instruction and the fast 
path instruction is unsuccessful. The MROM instruction 
occupies at least three issue positions, and therefore the fast 
path instruction cannot be concurrently dispatched for the 
embodiment of microprocessor 10 shown in RG. 1. If the 
packed slate is clear, concurrent dispatch of an MROM and 
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fasl paih instructions was not performed in the previous 
clock cycle. Therefore, the instructions within the poten- 
tially dispatchable instructions were not previously dis- 
patched. 

According to one embodinoent, selection control unit 76 
selects the foremost instructions in program order from the 
set of potentially dispatchable instructions if either the 
packed state is clear or the packed state is set and the double 
dispatch signai is deasserted. In the case of the packed state 
being clear, the foremost set of iostructioos arc dispatched 
and program order is maintained. In the case of the packed 
state being set and the double dispatch signal being 
deasserted, the second of the instructions dispatched during 
the previous clock cycle is redispatched. If the second of the 
instructions is the MROM instruction, it is dispatched alone. 
If the second of the instructions is the fast path instruction, 
additional instructions may be selected for concurrent dis- 
paidi. Advantageously, the largest number of concurrently 
dispatchable instructions is selected, even in the case of 
redispatching a previously dispatched instruction. 

If the packed state is set and the double dispatch signal is 
asserted, then the instruction within the potentially dis- 
patched instructions which is foremost in program order is 
the second of the previously dispatched instructions and that 
instruction is successfully dispatched during the previous 
clock cycle (i.e. the MROM instruction and fasl path 
instruction, when taken together, occupy a number of issue 
positions less than or equal to the number of issue positions 
included within microprocessor 10), This instruction is 
therefore not selected during the current clock cycle. 
Instead, instructions are dispatched from the remaining of 
the potentially dispatchable instructions. 

Upon selection of the instructions di^aiched, the packed 
state is determined for the subsequent clock cycle. In 
addition, control unit 90 is informed of the instructions 
dispatched. For the case of the packed state being set and the 
double dispatch signal being asserted, the instruction which 
was previously dispatched is indicated as dispatched as well 
as each of the instructions dispatched during the present 



control unit 76 selects a set of potentially dispatchable 
insuuctions. For the present embodinent, up to four instruc- 
tions (e.g. instructions A, B, C, and D, in program order) 
may be selected. Preferably, the maximum number of 
5 instructions selected into the set of potentially dispatchable 
instructions is the number of issue positions included in 
microprocessor 10 plus the number of instructions which 
may be redispatched due to an unsuccessful concurrent 
dispatch of MROM instructions and fast path instructions 
10 during a previous clock cycle. Hicrcforc, the maximum 
number of instructions within the potentially dispatchable 
instructions may vary from embodiment to cmbodimcnL 

The selection of instructions according to step 102 may 
involve certain restrictions. For example, the present 
embodiment may concurrently dispatch up to three fast path 
instructions (one for each issue position), a fast path instruc- 
tion and an MROM instruction, or an MROM instruction 
alone. Therefore, instructions C and D may not be MROM 
instructions in the present embodiment. If the instruction 
which would otherwise be instruction C is an MROM 
instruction, no instructions are selected as instruction C or 
D. Furthermore, microprocessor 10 allows up to one pre- 
dicted taken branch instruction to be concurrently 
dispatched, according to one embodiment If a second 
predicted taken branch instruction is encountered, selection 
control unit 76 does not select that branch instruction or any 
subsequent instructions for potential dispatch. According to 
another embodiment, instructions from at most two cache 
lines may be concurrently dispatched. If an instruction from 
a third cache line is encountered, it is not selected for 
concurrent dispatch. These restrictions may not be applied in 
other embodiments. Additional or supplemental restrictions 
may be applied in other embodiments as well. 

Decision box 104 determines which of the set of poten- 
tially dispatchable instructions are selected for dispatch, 
based upon the packed state and the double dispatch signal. 
If the packed state is set and the double dispatch signal is 
asserted, instructions B, C, and D are selected for dispatch 
(step 106). En this case, instruction A is one of the previously 
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clock cycle. Subqueues 86 are shifted accordingly. In one aq dispatched instructions. Since the concurrent dispatch of the 



embodiment, control unit 90 is informed of the subqueue 
and position storing the last instruction (in program order) to 
be dispatched. Selection control unit 76 identifies the last 
instruction in accordance with the above functionality. Byte 
queue 74 shifts out the instructions prior to and including the 45 
indicated last instruction. In this manner, byte queue 74 
operates independent of the logic used to concurrently 
dispatch MROM and fast path instructions. For example, 
when packing an MROM instruction and a fast path 
instruction, the first of the instructions in program order is 50 
marked as the last instruction. The second of the instructions 
is thereby retained in byte queue 74 while the first of the 
instructions is shifted out. 

It is noted that, in one embodiment, the circuitry shown in 
FIG. 4 for inslmction alignment unit 18 forms first align- ss 
ment stage 54. Second alignment stage 56 is not shown in 
FIG. 4. It is further noted that additional details regarding the 
operation of byte queue 74 may be found in the commonly 
assigned, co-pending patent application entitled: "A Byte 



MROM instruction and ihe fast path instruction is 
successful, instruction A need not be redispatched during the 
present clock cycle. If the packed slate is clear or the double 
dispatch signal is deasserted, then instructions A, B, and C 
arc selected for dispatch (step 108). 

Following selection of instructions for dispatch, selection 
control unit 76 determines the packed state for the subse- 
quent clock cycle (decision block 110). For the present 
embodiment, if the selected instructions include an MROM 
instruction and a fast path instruction, then the packed stale 
is set (step 112). The padccd state is clear if the selected 
instructions do not include an MROM iiist ruction and a fast 
path instruction (step 114). 

Generally speaking, a selection method similar to flow- 
chart 100 may be used to speculatively dispatch a set of 
instructions concurrently. The dispatch is speculative in the 
sense that the dispatched set of instructions may subse- 
quently be determined to occupy a number of issue positions 
greater than the number of issue positions included within 



Queue Divided into Multiple Subqueues for Optimizing 60 the microprocessor. Upon such speculative dispatch, the 



Instruction Selection Logic", filed concurrently herewith by 
Narayan, et al. The disclosure of the referenced patent 
application is incorporated herein by S reference in its 
entirety. 

Tiiming next to FIG. 5, a flowchart 100 depicting the 
operation of selection control unit 76 is shown according to 
one exemplary embodiment. During step 102, selection 
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packed state may be set and a number of instructions equal 
to the maximtmi number of instructions which may be 
redispatched are retained in the byte queue. The retained 
instructions as well as a set of additional instructions may 
then be preliminarily selected as potentially dispatchable 
instruction in the succeeding clock cycle, and appropriate 
selection of instructions from the potentially dispatchable 
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instructions may be performed based upon the result of the 
speculative dispatch. 

"Riming next to FIG. 6, a logic diagram of one embodi- 
ment of multiplex to issue block 78 is shown. Multiplex to 
issue block 78 inchides a ptaralily of first multiplexors s 
120A-120D and a plurality of second multiplexors 
122A-122C. First multiplexors 120 receive a set of selection 
controls upon a first selection controls bus 124 from selec- 
tion control unit 76. Each first multiplexor 120 receives a 
separate set of selection controls, in one embodiment. 10 
Similarly, a selection control for multiplexors 122 are 
received upon second selection controls bus 126. In one 
embodiment, one selection control is included upon selec- 
tion controls bus 126. The selection control is shared by 
second multiplexors 122. Second multiplexor 122 A is 15 
coupled to first mult^lexors 120 A and 120B. Similarly, each 
second multiplexor 122 is coupled to a pair of first multi- 
plexors 120 as shown in FIG. 6. Additionally, each of second 
multiplexors 122 selects an instruction for conveyance to a 
corresponding decode unit 20. However, the instruction 20 
selected may pass through additional instruction processing 
pipeline stages prior to arrival in the corresponding decode 
unit. For example, one embodiment of microprocessor 10 
employs the instruction processing pipeline shown in FIG. 3. 
Instructions pass firom second multiplexors 122 through 25 
second alignment stage 56 prior to anival in decode units 20 
in that embodiment. 

Selection control unit 76 generates selection controls for 
each first multiplexor 120 by scanning the position infor- 



mation is stored therein with respect to an overflow instruc- 
tion. The symbol "0" represents an instruction subsequent to 
the instructions identified by symbols "A" and "B" in 
program order. 

Table 130 illustrates that instruction identification infor- 
mation is stored in position U only when instmction iden- 
tification information is stored in position 10. Similarly, 
instruction identification information is stored in position 12 
only when instruction identification information is stored in 
both positions 10 and II, except when position 12 stores an 
overflow instruction. When position 12 stores an overflow 
instruction, position U still stores instruction identification 
information only if position 10 stores instruction identifica- 
tion information. However, position 12 stores the overflow 
instruction information independent of the status of posi- 
tions 10 and II. Advantageously, the instruction which is 
foremost in program order is stored in position 10, even after 
instructions from the instriiction block have been dis- 
patched. Additionally, overflow instructions remain stored in 
position 12, even after instructions from the instruction block 
have been dispatched. 

Ibming next to FIG. 8, a table 132 is shown depicting the 
positions analyzed by selection control unit 76 for selecting 
instructions via each first multiplexor 120 in accordance 
with one embodiment. Each row of table 132 corresponds to 
a particular first multiplexor 120A-120D, as identified by 
the first column of table 132. For first multiplexor 120 A, 
selection control unit 76 analyzes the instruction identifica- 
tion information stored in position 10 of each subqueue 86. 



mation stored in byte queue 74. First multiplexor 120A 30 Additionally, the overflow bit is examined for subqueues 



produces the instruction which is foremost in program order 
within byte queue 74. Similarly, multiplexors 120B, 120C, 
and 120D produce the second, third, and fourth instructions 
in program order from the instructions stored in byte queue 
74, respectively. 

Second multiplexors 122 arc provided for selecting 
instructions from the potentiafly dispatchable instructions 
identified by first multiplexors 120. The seleaion control 
upon second selection control bus 126 is toggled in con- 



86A and 86B. If position 10 of first subqueue 86A is valid, 
then that instruction is selected via multiplexor 120 A. If 
position 10 of second subqueue 86B is valid and position 10 
of first subqueue 86A is invalid, then the instruction corrc- 
35 spending to position 10 of second subqueue 866 is selected. 
Finally, the instruction corresponding to position 10 of third 
subqueue 86C is selected if position 10 of subqueues 86A 
and 86B arc invalid. Because instructions may be up to 15 
bytes long, an instruction may begin in first subqueue 86A, 



formance with the flowchart shown in FIG. 5. In other 40 overflow into second subqueue 86B, and further overflow 



words, the selection control is toggled to cause second 
multiplexor 122A to select the output of first multiplexor 
120A, second multiplexor 122 B to select the output of first 
multiplexor 120B, and second multiplexor 122C to select 
the output of first multiplexor 120C if step 108 is performed. 
Alternatively, when step 106 is performed the selection 
control is toggled to cause second multiplexor 122 A to select 
the output of first multiplexor 120B, second multiplexor 
122B to select the output of first multiplexor 120C, and 
second multiplexor 122C to select the output of first multi- 
plexor 120D. 

TUming next to FIG. 7, a table 130 is shown identifying 
the valid combinations of instructions which may be stored 
within a particular subqueue according to the present 
embodiment. Other embodiments may employ similar or 
dissimilar combinations of instructions. Each row of table 
130 is a valid combination of instructions, listed by position 
(10-12 as shown across the top of table 130). The symbol 
"X" in a position indicates an invalid instruction (i.e. no 
instruction identification information is stored therein). The 
symbols "A", "B", and "C" indicate that valid iostniclioQ 
identification information is stored in that position. Symbol 
"A" identifies an instruction prior to instructions identified 
by symbols "B" and "C in program order. Similarly, 
symbol "B" identifies an instruction prior to the instruction 
identified by symbol "C in program order. The symbol "0" 
in position 12 indicates that instruction identification infor- 



45 



50 



55 



65 



into third subqueue 86C. Such a case is an example where 
position 10 of third subqueue 86C is selected. In addition, the 
overflow indications from first and second subqueues 86A 
and 86B are considered in creating multiplexer selection 
controls for first multiplexor 120A. If position 10 of second 
subqueue 86B is selected and the overflow indication of first 
subqueue 86A indicates overflow, then instruction bytes 
from first subqueue 86A form a portion of the instruction 
indicated by position 10 of second subqueue 86B. The start 
pointer and valid mask conesponding to position 12 of 
subqueue 86A are used to multiplex the instmction bytes 
with instruction bytes indicated by the end pointer and valid 
ma^ corresponding to instruction 10 of subqueue 86B (or 
subqueue 86C, if the instruction overflows Uiereto). 

For first multiplexor 120B, position U of first sut>queue 
86A is analyzed along with positions 10 and II of second and 
third subqueues 86B and 86C. Because byte queue 74 
maintains each subqueue such that instructions are shifted to 
occupy positions 10 and U when previous instructions 
within the subqueue arc dispatched, position U is the only 
position within first subqueue 86A which may contain the 
second instruction in program order which is to be dis- 
patched during a clock cycle. Similarly, positions 10 and 11 
of subqueues 86B and 86C may contain the second instruc- 
tion to be dispatched. If a position 10 is selected for dispatch, 
the overflow indication of the preceding subqueue is also 
analyzed for forming multiplexor selection controls. 
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For firsi multiplexor 120C, position 12 of first subqueue 
86A is analyzed along with each of the positions of second 
and third subqueues 86B and 86C. Similar to the above 
discussion, if position 10 of a stibqucuc is selected, the 
overflow indication of the preceding subqueue is analyzed to 5 
determine multiplexor selection controls. Finally, selection 
control unit 76 considers the positions of subqueues 866 and 
86C to determine scleaion controls for first multiplexor 
120 D. Since first multiplexor 12 OD selects the fourth 
instruction in program order, positions within first subqueue jo 
86A are not considered for selection via multiplexor 120D. 
Positions within first subqueue 86A store at most the first 
three instructions in program order, according to the present 
embodiment. 

Table 132 illustrates certain advantages of operating byte iS 
queue 74 in the manner described herein. Selection of the 
instructions for first multiplexor 120A involves analyzing 
only three positions out of the nine positions included within 
byte queue 74. If byte queue 74 where implemented as, for 
example, a circular buffer, then each of the positions would 20 
have to be considered for dispatch to issue position zero. 
Similarly, analysis of only a few issue positions is performed 
to select the instruction for first multiplexor 120B. Selection 
logic is thereby reduced, allowing for fewer cascaded levels 
of logic. Ahi^er operating frequency for microprocessor 10 25 
may thereby be achieved. 

Tkble 132 shows the positions analyzed to select instruc- 
tions for issue. However, even though an instruction may be 
selectable for issue based upon table 132, other factors may 
cause a particular instruction not to be issued. For example, 30 
the instruction in position 10 of second subqueue 86B may 
be selected by multiplexor 120B according to table 132. 
However, the instruction in position 10 of first subqueue 86A 
may be an MROM instruction. Because microprocessor 10 
dispatches MROM instructions without concurrent issue of 35 
other MROM instructions, the instruction in position 10 of 
second subqueue 86B is not selected if it is an MROM 
instruction. Other such restrictions may be imposed depend- 
ing upon the embodiment of microprocessor 10, and may be 
include within the logic of selection control unit 58. lo 

TUming next to FIG. S>, a table 134 is shown depicting the 
allowable issue position combinations (i.e. the selections by 
second multiplexors 122 under control of selection control 
unit 76) according to one embodiment of microprocessor 10. 
Each row of table 134 indicates an allowable combination of 45 
instructions, wherein each column is an issue position. A 
in an issue position indicates that no instruction is dis- 
patdied to that issue position. An instruction may not be 
dispatched to a particular issue position for a variety of 
reasons. For example, dispatch restrictions included accord- 50 
ing to a particular embodiment of microprocessor 10 may 
cause an issue position not to receive an instruction during 
a clock cycle. One embodiment of microprocessor 10 
restricts concurrent issue of instructions with an MROM 
instruction to up to one fast path instruction. Therefore, rows 55 
136 and 138 are allowable combinations in which issue 
position two does not receive an instruction. Additionally, 
byte queue 74 may contain only a few instmctions during a 
particular clock cycle (in the case of a fetch miss in 
instruction cache 16, for example). Therefore, selection 60 
control unit 76 selects the available instructions and no 
instructions are conveyed in the remaining issue positions. 
An "F' in an issue position indicates a fast path instruction, 
while an "M" indicates an MROM instruaion. 

Rows 136 and 138 are cases in which the packed state 65 
described above is set, for embodiments employing the set 
of allowable issue position combinations represented by 
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table 134. Other rows in table 134 do not cause the setting 
of the packed state. 

"niming now to FIG. 10, a first example of operation of 
selection control unit 76, first multiplexors 120, and second 
multiplexors 122 is shown in accordance with one embodi- 
mcnt of microprocessor 10. A set of clock cycles arc 
depicted, separated by dashed horizontal lines. The clock 
cycles arc numbered 1, 2, and 3. For each clock cycle, the 
selections of first multiplexors 120 are shown via a block 
140A-140D. and the selections of second multiplexors 122 
are shown via a blodc 142A-142C. Instructions within a 
block are listed from right to left in program order (i.e. 
instruction F, in block 140Ais prior to the other instructions 
in program order). Instructions are represented by an "F" for 
fast path instructions or an "M" for MROM instmctions. A 
subscript is used to identify different fast path and MROM 
instructions. 

During clock cycle 1, a set of potentially dispatchable 
instructions is selected via first multiplexors 120 (block 
140A). It is noted that instructions Fj and F3 are shown in 
block 140A to depict instructions subsequent to instruction 
Mj. However, selection control unit 76 may not actually 
select instructions F2 and F3 since Fi and M^ comprise an 
allowable combination as shown in table 134. Alternatively, 
instructions Fj and may be selected by first multiplexors 
120c and 120D, but may not be selected by second multi- 
plexors 122. Still further, the instructions may be routed 
through first multiplexors 120 and second multiplexors 122, 
but may be indicated to be invalid such that subsequent 
stages of the instruction processing pipeline ignore the 
insuiiclions. As shown in block 142 A, instructions Fj and 
M| are selected by second multiplexors 122 A and 122B, 
respectively. Second multiplexor 122C does not select a 
valid instruction since instructions Fj and Mj arc concur- 
rently dispatched. Additionally, selection control unit 76 sets 
the packed state, as shown in clock cycle 2. Since instruc- 
tions Fj and Ml are speculatively dispatched concurrently, 
instruction Mj is retained within byte queue 74 for potential 
re dispatch. 

During clock cycle 2, instructions M^, Fj, F3, and F^ are 
selected as a set of potentially dispatchable instructions 
(block 140B). Because the packed state is set, the double 
dispatch signal form MROM unit 34 is used to select which 
instructions from block 1408 are dispatched. When the 
double dispatch signal is received, one of two possible sets 
of instructions are selected. If the double dispatch signal is 
deasserted, the instructions are selected as shown in block 
142B. For this case, instruction Mj was determined to not be 
a double dispatch instruction. Therefore, instruction Mj is 
redispatched. Additionally, other instructions are not con- 
currently dispatched with instruction Mj. Alternatively, the 
double dispatch signal may be asserted, resulting in the 
instruction selection shown in block 142C. Instmction M^ is 
not redispatched. Instead, instructions subsequent to Mj are 
dispatched (i.e. instructions F^, F,, and F4). Blocks 140C 
and 140D depict instructions selected as potentially dis- 
patchable instructions during clock cycle 3 for the cases 
represented by blodcs 142B and 142C, respectively. 

Turning now to FIG. 11, a second example of operation of 
selection control unit 76, first multiplexors 120, and second 
multiplexors 122 is shown in accordance with one embodi- 
ment of microprocessor 10. For each clock cycle, the 
selections of first multiplexors 120 are shown via a block 
144A-144D, and the selections of second multiplexors 122 
are shown via a block 146A-146C. Instructions within a 
block are listed from right to left in program order (i.e. 
instruction M^ in block 140 A is prior to the other instruc- 
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lions in program order). Insiniciions are represented by an 
"F" for fast path instructions or an "M" for MROM instruc- 
tions. A subscript is used to identify different fast path and 
MROM instructions. 

Qock cycle 1 in FIG. 11 is similar to clock cycle 1 in FIG. 5 
10, except that instructions M^ and Fj arc in reverse order 
for this example. Therefore, instruction Fj is retained in byte 
queue 74 while being dispatched during clock cycle 1. 
Instruction is discarded, and the packed state is set for 
clock cycle 2. During clock cycle 2, instructions arc selected lO 
via first multiplexors 120 as shown in block 144B. Since the 
packed state is set, one of two possible sets of instructions 
may be selected from the instructions in block 144B. Block 
146B shows the instructions selected if the double dispatch 
signal is deasserted, while block 146C shows the instruc- 15 
tions selected if the double dispatch signal is asserted. 
Blocks 144C and 144D depict instructions selected by first 
multiplexors 120 during clock cycle 3 for the cases shown 
in blocks 146B and 146C, respectively. 

As block 146B of example of FIG. 11 shows, when the 20 
fast path instruction is the instruction rcdispatched due to an 
unsuccessful concurrent dispatch of an MROM and fast path 
instruction, additional instructions may be dispatched as 
well. Advantageously, dispatch bandwidth may be maxi- 
mized even during clock cycles in which a redispatch is 25 
performed. 

'Himing now to FIG. 12, a computer system 200 including 
microprocessor 10 is shown. Computer system 200 further 
includes a bus bridge 202, a main memory 204, and a 
plurality of input/output (I/O) devices 206A-206N. Plurality 3D 
of I/O devices 206A-206N will be collectively referred to as 
I/O devices 206. Microprocessor 10, bus bridge 202, and 
main memory 204 are coupled to a system bus 208. I/O 
devices 206 arc coupled to an I/O bus 210 for communica- 
tion with bus bridge 202. 35 

Bus bridge 202 is provided to assist in communications 
between 1/0 devices 206 and devices coupled to system bus 
208. I/O devices 206 typically require longer bus clock 
cycles than microprocessor 10 and other devices coupled to 
system bus 208. llierefore, bus bridge 202 provides a buOTer aq 
between system bus 208 and input/output bus 210. 
Additionally, bus bridge 202 translates transactions from 
one bus protocol to another. In one embodiment, input/ 
output bus 210 is an Enhanced Industry Standard Architec- 
ture (EISA) bus and bus bridge 202 translates from the 45 
system bus protocol to the EISA bus protocol In another 
embodiment, input/output bus 210 is a Peripheral Compo- 
nent Interconnect (PCI) bus and bus bridge 202 translates 
from the system bus protocol to the PCI bus protocol. It is 
noted that many variations of system bus protocols exist. 50 
Microprocessor 10 may employ any suitable system bus 
protocol. 

I/O devices 206 provide an interface between computer 
system 200 and other devices external to the computer 
system. Exemplary I/O devices include a modem, a serial or 55 
parallel port, a sound card, etc. I/O devices 206 may also be 
referred to as peripheral devices. Main memory 204 stores 
data and instructions for use by microprocessor 10. In one 
embodiment, main memory 204 includes at least one 
Dynamic Random Access Memory (DRAM) and a DRAM 60 
memory controller. 

It is noted that although computer system 200 as shown in 
FIG. 12 includes one bus bridge 202, other embodiments of 
computer system 200 may include multiple bus bridges 202 
for translating to multiple dissimilar or similar I/O bus 65 
protocols. Still further, a cache memory for enhancing the 
performance of computer system 200 by storing instructions 
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and data referenced by microprocessor 10 in a faster 
memory storage may be included. Hie cache memory may 
be inserted between microprocessor 10 and system bus 208, 
or may reside on system bus 208 in a "lookaside" configu- 
ratioa 

It is noted that, although double dispatch MROM instruc- 
tions are described above with respect to dispatching micro- 
code instructions concurrently with directly-decoded 
instructions, the present discussion applies to many different 
configurations. For example, for a miaoproccssor having 
four issue positions, microcode instructions which are 
parsed into two or three simpler instructions may be con- 
currently dispatched with two or one directly-decoded 
instructions, respectively. Such an embodiment might select 
a microcode instruction and two directly-decoded instruc- 
tions for concurrent dispatch. If the microcode instruction is 
parsed into two instructions, then the concurrent dispatch is 
successful. If the microcode instruction is parsed into three 
instructions, then the concurrent dispatch of one of the 
directly-decoded instructions may be successful. Redispatch 
of one of the concurrently dispatched instructions occurs. If 
the microcode instruction is parsed into four or more 
instructions, then the concurrent dispatch is unsuccessful 
and redispatch of two of the concurrently dispatched instruc- 
tions occurs. Similarly, additional issue positions may be 
added with extensions to the number of possible concurrent 
dispatches and the number of redispatch scenarios. Any 
number of issue positions may be employed within various 
embodiments. Still further, although microcode instructions 
are divided depending upon the use of two issue positions or 
more than two issue positions, any division may be used. 
Continuing the four issue position example, microcode 
instructions may be specified as three dispatch or more than 
three dispatch. Microcode instructions which actually use 
two issue positions would waste an issue position, but (he 
number of redispatch scenarios is decreased. Still further, if 
a microcode instruction occupies each of the issue positions 
for several clock cycles, but the last clock cycle of instruc- 
tion issue does not occupy each of tbe available issue 
positions, directly-decoded instmctions subsequent to the 
microcode instruction may be dispatched during tbe last 
clock cyde of instruction issue by MROM unit 34 in 
response to the microcode instruction. In this case, a number 
of issue positions occupied is passed to selection control unit 
76. Selection control unit 76 determines the number of 
instructions to be rcdispatched from the number of instruc- 
tions selected for dispatch and the number of issue positions 
occupied by the instructions issue by MROM unit 34. 

Although the x86 microprocessor architecture and 
instruction set have been used as a specific example herein, 
it is noted that the apparatus and method described herein 
may be applicable to any microprocessor which employs 
microcode and directly-decoded instructions. Such embodi- 
ments are contemplated. 

It is still further noted that the present discussion may 
refer to the assertion of various signals. As used herein, a 
signal is "asserted" if it conveys a value indicative of a 
particular condition. Conversely, a signal is "deasserted" if 
it conveys a value indicative of a lack of a particular 
condition. A signal may be defined to be asserted when it 
conveys a logical zero value or, conversely, when it conveys 
a logical one value. Additionally, various values have been 
described as being discarded in the above discussion. A 
value may be discarded in a number of maimers, but 
generally involves modifying the value such that it is 
ignored by logic circuitry which receives the value. For 
example, if the value comprises a bit, the logic state of the 
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value may be inverted to discard the value, (f the value is an 
D-bit value, one of the o-bit encodings may indicate that the 
value is invalid. Setting the value to the invalid encoding 
causes the value to be discarded. Additionally, an n-bit value 
may include a valid bit indicative, when set, that the n-bit s 
value is valid. Resetting the valid bit may comprise discard- 
ing the value. Other methods of discarding a value may be 
used as well. 

Table 1 below indicates fast path, double dispatch, and 
MROM instructions for one embodiment of microprocessor 
10 employing the x86 instruction set: 



TABLE 1 



ji86 Fast Palh, Double Dispatch, and MROM 
Instructions 







AAA 


MROM 


AAD 


MROM 


AAM 


MROM 


AAS 


MROM 


ADC 


fast path 


ADD 


bst path 


AND 


fasi path 


ARPL 


MROM 


BOUND 


MROM 


BSF 


fast path 


BSR 


fast path 


BSWAP 


MROM 


BT 




ETC 


East patli 


BTR 


fast patli 


BTTS 


&st path 


CALL 


fast path/double dispAtcb 


CBW 


&tst path 


CWDE 


fest path 


CLC 


fast path 


CLX) 


fast patb 


CLI 


MROM 


CLXS 


MROM 


CMC 


fast palh 


CMP 


fast palh 


CM PS 


MROM 




MROM 




MROM 


CMPSD 


MROM 


CMPXCHG 


MROM 


CMFXCHGSB 


MROM 


CPUID 


MROM 


CWD 


MROM 


CWQ 


MROM 


DDA 


MROM 


DAS 


MROM 


DEC 


fast path 


DIV 


MROM 


ENTER 


MROM 


HIX 


MROM 


IDIV 


MROM 


IMUL 


double dispatch 


EST 


MROM 


INC 


&st path 


INS 


MROM 


INSB 


MROM 


INSW 


MROM 


INSD 


MROM 


INT 


MROM 


INTO 


MROM 


INVD 


MROM 


INVLPO 


MROM 


IRET 


MROM 


mPTD 


MROM 


Jcc 


bst path 


JCX2 


double dispatch 


JECXZ 


double dispatch 


JMP 


fast path 


l^HF 


fast path 


L\R 


MROM 



IS 



20 



25 



30 



35 



45 



50 



60 



65 





30 




TABLE 1 -continued 


xS6 Fast Path, Double Di^atch, and MROM 




[nstrtictioiB 


X86 Instniction InstnKtion Category 


IDS 


MROM 


L£S 


MROM 


LPS 


MROM 


LGS 


MROM 


LSS 


MROM 


LEA 


fast path 


LEAVE 


double dispatch 


LGDT 


MROM 


LIDT 


MROM 


LLDT 


MROM 


LMSW 


MROM 


LODS 


MROM 


LODSB 


MROM 


LODSW 


MROM 


LODSD 


MROM 


LOOP 


double dispatch 


LOOPcond 


MROM 


LSL 


MROM 


ITR 


MROM 


MOV 


&Et path 


MOVCC 


path 


MOV.CR 


MROM 


MOV.DR 


MROM 


MOVS 


MROM 


MOVSB 


MROM 


MOVSW 


MROM 


MOVSD 


MROM 


MOVSX 


fast path 


MOVZX 


fast path 


MUL 


double dispatch 


NEG 


&slpath 


NOP 


&st path 


NOT 


&5t path 


OR 


fast path 


OUT 


MROM 


OLITS 


MROM 


OUTSB 


MROM 


oursw 


MROM 


OLTTSD 


MROM 


POP 


double dispatch 


POPA 


MROM 


POPAD 


MROM 


POPF 


MROM 


POPFD 


MROM 


PUSH 


fast path/double dispatch 


PUSHA 


MROM 


PUSHAD 


MROM 


PUSH? 


East path 


PUSHFD 


fast path 


RCL 


MROM 


RCR 


MROM 


ROL 


fact path 


ROR 


fast path 


RDMSR 


MROM 


REP 


MROM 


REPE 


MROM 


REPZ 


MROM 


REPNE 


MROM 


REPNZ 


MROM 


RCT 


double dispatch 


RSM 


MROM 


SAHF 


fast path 


SAL 


fast path 


SAR 


&st path 


SHL 


&st path 


SHR 


East path 


SBB 


&st path 


SCAS 


double dispatch 


SCASB 


MROM 


SCASW 


MROM 


SCASD 


MROM 


SETcc 


fast path 


SGDT 


MROM 


SIDT 


MROM 
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TABLE 1 -continued 



x86 Fast Path, Double Dispatch, and MROM 
Instructions 



XS6 Instruction Instruction Category 



SHLD 


MROM 


SHRD 


MROM 


SLDT 


MROM 


SMSW 


MROM 


SVC 


£ast path 


STD 


fast path 


sn 


MROM 


STOS 


MROM 


STOSB 


MROM 


STOSW 


MROM 


STOSD 


MROM 


STR 


MROM 


SUB 


fast path 


TEST 


path 


VERR 


MROM 


VERW 


MROM 


WBINVD 


MROM 


WRMSR 


MROM 


XADD 


MROM 


XCHG 


MROM 


XLAT 


fast path 


XLATB 


path 


XOR 


path 



15 



20 



25 



Note; Instructions including an SIB byte are also considered double dis- 
patch instructions. 



It is noted that a superscalar microprocessor in accordance 
with the foregoing may further employ the latching stnic- 30 
tures as disclosed within the co-pending, commonly 
assigned patent application entitled "Conditional Latching 
Mechanism and Pipelined Microprocessor Employing the 
Same", Sen No. 08/400,608 filed Mar. 8. 1995, by Pflura el 
al The disclosure of this patent application is incorporated 35 
herein by reference in its entirety. 

It is further noted that aspects regarding array circuitry 
may be foimd in the co-pending, commonly assigned patent 
application entitled "High Performance Ram Array Circuit 
Employing Self-Time Clock Generator for Enabling Array io 
Access", Ser. No. 08/473,103 filed Jun. 7, 1995 by Tran. The 
disclosure of this patent application is incorporated herein 
by reference in its entirety. 

It is additionally noted that other aspects regarding super- 
scalar microprocessors may be found in the following 45 
co-pending, commonly assigned patent applications: "Lin- 
early Addressable Microprocessor Cache", Scr. No. 08/146, 
381, filed Oct. 29, 1993 by Witt; "Superscalar Microproces- 
sor Including a High Performance Instruction Alignment 
Unit", Ser. No. 08/377,843, filed Jan. 25, 1995 by Witt, et al; 50 
"A Way Prediction Suiicture", Ser. No. 08/522,181, filed 
Aug. 31, 1995 by Roberts, et al; "A Data Cache Capable of 
Performing Store Accesses in a Single Clock Cycle", Ser. 
No. 08/521,627, filed Aug. 31, 1995 by Witt, et al; "A 
Parallel and Scalable Instruction Scanning Unit", Scr. No. 55 
08/475,400, filed Jun. 7, 1995 by Narayan; and "An Appa- 
ratus and Method for Aligning Variable-Byte Length 
Instructions to a Plurality of Issue Positions", Scr. No. 
08/582,473, filed Jan. 2, 1996 by Narayan, et al. The 
disclosure of these patent applications arc incorporated 60 
herein by reference in their entirely. 

In accordance with the above disclosure, a method and 
apparatus for concurrently dispatching microcode instruc- 
tions and directly-decoded inslruciions is provided. The 
microcode and directly-decoded instructions are specula- 65 
lively dispatched under the assumption that the microcode 
instruction occupies a fixed, predetermined number of issue 
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positions. The predetermined number of issue positions is 
less than the total number of issue positions available within 
the microprocessor. If the microcode instruction is found to 
occupy a larger number of issue positions, then one or more 
of the concurrently dispatched instructions arc rcdispatcbcd 
in the subsequent clock cycle. Advantageously, dispatch 
bandwidth is increased dtiring clock cycles in which a 
microcode instruction occupying the predetermined fixed 
number of issue positions is concurrently dispatched along 
with additional iostructioos. 

Numerous variations and modifications will become 
apparent to those skilled in the art once the above disclosure 
is fully appreciated. It is intended that the following claims 
be interpreled to embrace all such variations and modifica- 
tions. 

What is claimed is: 

1. A microprocessor comprising; 

an instruciion cache configured to store instructions; 

an instruction alignment unit coupled to receive a plural- 
ity of instructions fetched from said instruction cache, 
said plurality of instructions including a directly- 
decoded instruction and a microcode instruction, 
wherein said instruction alignment unit is configured to 
select a first dispatch plurality of instructions including 
said direcdy-decoded instruction and said microcode 
instruciion, and wherein said instruciion alignment unit 
is configured to select said first dt^atch plurality of 
bstructions from said plurality of instructions; and 

a microcode unit coupled to receive said microcode 
instruction from said instruction cache, wherein said 
microcode unit is configured to determine a number of 
directly-decoded instructions corresponding to said 
microcode instruction, and wherein said microcode unit 
is configured to transmit a signal indicative of said 
number of directly-decoded instructions; 

wherein said instruction alignment unit is coupled to 
receive said signal from said microcode unit, and 
wherein said instruction alignment unit is configured to 
determine if said first dispatch plurality of instruaions 
is concurrently dispatchable in response to said signal, 
and wherein said instruction alignment unit is config- 
ured to discard one of said microcode instruction and 
said directly-decoded instruciion from said first dis- 
patdi plurality of instructions in response to determin- 
ing that said first dispatch plurality of instructions is not 
concurrently dispatchable. 

2. The microprocessor as recited in claim 1 further 
comprising a plurality of decode units coupled lo receive 
said first dispatch plurality of instructions from said instruc- 
tion alignment unit, wherein said plurality of decode units 
are configiu'ed to decode directly-decoded instructions. 

3. The microprocessor as recited in claim 1 wherein said 
instruciion aligrunent unit is further configured to retain said 
one of said microcode instruction and said directly-decoded 
instniction discarded from said first dispatch plurahty of 
instructions, and wherein said instruction alignment unit is 
further configured to subsequently dispatch said one of said 
microcode instruction and said directly-decoded instruciion. 

4. The microprocessor as recited in claim 3 wherein said 
one of said microcode instruction and said directly-decoded 
instruction is a second one in program order of said micro- 
code instniction and said directly-decoded instruction. 

5. The microprocessor as recited in claim 4 wherein said 
instruction alignment unit is configured to select a second 
dispatch plurality of inslnictions if said second one in 
program order is said directly-decoded instruction, and 
wherein said directlyniecoded instruction is included within 
said second dispatch plurality of instructions. 
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6. The microprocessor as recited in claim 4 wherein said 
instruction alignment unit is configured to dispatch said 
microcode instruction individually if said second one of said 
microcode instruction and said directly-decoded instruction 

is said microcode instruction. 5 

7. The microprocessor as recited in claim 3 wherein said 
instruction alignment unit comprises an instruction queue 
configured to store said pliuality of instructions fetched from 
said instruction cache, and wherein said instruction align- 
ment unit is configured to retain said one of said microcode lO 
instruction and said directly-decoded instruction by inhibit- 
ing deletion of said one of said microcode instruction and 
said directly-decoded instruction from said instruction 
queue. 

8. The microprocessor as recited in claim 1 wherein said 15 
instruction alignment unit comprises a first alignment stage 
and a second alignment stage. 

9. The microprocessor as recited in claim 8 wherein said 
first alignment stage is configured to select said first dispatch 
plurality of instructions. 20 

10. The microprocessor as recited in claim 9 wherein said 
second alignment stage is coupled to receive said first 
dispatch plurality of instructions from said first alignment 
stage, and wherein said second alignment stage is coupled to 
receive said signal from said microcode umt and to discard 25 
said one of said microcode instruction and said directly - 
decoded instruction from said first dispatch plurality of 
instructions. 

11. The microprocessor as recited in claim 10 wherein 
said first alignment stage is coupled to receive said signal 30 
from said microcode unit and to redispatch said one of said 
microcode instruction and said directly -decoded instruction, 
wherein said one of said microcode instruction and said 
directly-decoded instruction is a second one in program 
order of said microcode instruction and said directly- 3S 
decoded instruction. 

12. The microprocessor as recited in claim 1 wherein said 
instruction cache comprises an instruction scan unit config- 
ured to scan said plurality of instructions fetched from said 
instruction cache, and wherein said instruction scan unit is 4o 
configured to detect said microcode instruction and to route 
said microcode instruction to said microcode unit. 

13. A computer system comprising: 
a microprocessor including: 

an instruction cache configured to store instructions; 

an instruction alignment unit coupled to receive a 
plurality of instructions fetched from said instruction 
cache, said pluraUty of instructions including a 
directly-decoded instruction and a microcode 
instruction, wherein said instruction alignment unit so 
is configured to select a first dispatch plurality of 
instructions including said directly-decoded instruc- 
tion and said microcode instruction, and wherein 
said instruction alignment unit is configured to select 
said first dispatch plwality of instructions from said 
plurality of instructions; and 

a microcode unit coupled to receive said microcode 
instruction from said instruction cache, wherein said 
microcode unit is oonfigurcd to determine a number 
of directly-decoded instructions corresponding to SO 
said microcode instruction, and wherein said micro- 
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code unit is configured to transmit a signal indicative 
of said number of directly-decoded instructions; 

wherein said instruction alignment unit is coupled to 
receive said signal from said microcode unit, and 
wherein said instruction aligtmient uiut is configured to 
determine if said first dispatch plurality of instructions 
is concurrently dispatcbable in response to said signal, 
and wherein said instruction alignment unit is config- 
ured to discard one of said microcode instruction and 
said directly-decoded instruction fi-om said first dis- 
patch plurality of instructions in response to determin- 
ing that said first dispatch plurality of instructions is not 
concurrently dispatcbable; and 

an input/output (I/O) device coupled to said microproces- 
sor and to another computer system, wherein said 1/0 
device is configured to communicate between said 
computer system and said another computer system. 

14. The computer system as recited in claim 13 wherein 
said I/O device comprises a modem. 

15. A method for dispatching instructions in a 
microprocessor, the method comprising: 

speculatively packing a microcode instruction and a 
direcUy-decoded instruction into a first dispatch plu- 
rality of instructions for dispatch to a plurality of 
decode units; 

determining a number of directly-decoded instructions 
corresponding to said microcode instruction; 

determining if said first dispatch pluraUty of instructions 
arc concurrently dispatcbable responsive to said deter- 
mining a number of directly -decoded instructions cor- 
responding to said microcode instruction; 

discarding one of said microcode instruction and said 
directiy-deooded instruction from said first dispatch 
plurality of instructions responsive to said determining 
if said first dispatch plurality of instructions are con- 
currently dispatcbable; and 

dispatching said first dispatch plurality of instructions 
subsequent to said discarding. 

16. The method as recited in claim 15 further comprising 
redispatching said one of said microcode instruction and 
said directly-decode instruction responsive to said discard- 
ing. 

17. The method as recited in claim 15 wherein said 
discarding comprises discarding said microcode instruction 
if said microcode instruction is subsequent to said directly- 
decoded instruction in program order. 

18. The method as recited in claim 15 wherein said 
discarding comprises discarding said dircctly-dccodcd 
instruction if said directly-decoded instruction is subsequent 
to said microcode instruction in program order. 

19. The method as recited in claim 15 wherein said 
determining if said first dispatch plurality of instructions are 
concurrenUy dispatcbable comprises determining if a sum of 
said number of said directiy-decoded instructions corre- 
sponding to said microcode instruction and a number of 
remaining ones of said first dispatch plurality of instructions 
is less than a number of said phirality of decode units. 

¥ * * * 4 
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